Thank the author for this amazing repository. I am having problems with training the model with multiple GPUs and I wonder if anyone else is also having the problem. The training is fine when using a a single RTX3090, but whenever I tried to use 2 GPUs with the following command:
python main.py configs/resa/resa34_openlane.py --gpus 0 1
The following error occurs:
/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 75, in train_epoch
loss.backward()
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 141, in backward
gradtensors = _make_grads(tensors, gradtensors)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 50, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
After searching on the Internet, I found out that this error can be avoided by changing loss.backward() to loss.sum().backward(). However, this would cause the recorder and logging function to fail:
--- Logging error ---
Traceback (most recent call last):
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 1085, in emit
msg = self.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 929, in format
return fmt.format(record)
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 668, in format
record.message = record.getMessage()
File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 371, in getMessage
msg = str(self.msg)
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 116, in str
loss_state.append('{}: {:.4f}'.format(k, v.avg))
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 32, in avg
d = torch.tensor(list(self.deque))
ValueError: only one element tensors can be converted to Python scalars
Call stack:
File "main.py", line 66, in
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 89, in train_epoch
self.recorder.record('train')
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 97, in record
self.logger.info(self)
Message: <lanedet.utils.recorder.Recorder object at 0x7fd865ac7eb0>
Arguments: ()
Does anyone have a idea how to solve this? Any help is appreciated! Thank you.
Hi,
Thank the author for this amazing repository. I am having problems with training the model with multiple GPUs and I wonder if anyone else is also having the problem. The training is fine when using a a single RTX3090, but whenever I tried to use 2 GPUs with the following command:
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 75, in train_epoch
loss.backward()
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 141, in backward
gradtensors = _make_grads(tensors, gradtensors)
File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 50, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")
RuntimeError: grad can be implicitly created only for scalar outputs
python main.py configs/resa/resa34_openlane.py --gpus 0 1
The following error occurs: /home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' Traceback (most recent call last): File "main.py", line 66, inAfter searching on the Internet, I found out that this error can be avoided by changing
main()
File "main.py", line 36, in main
runner.train()
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train
self.train_epoch(epoch, train_loader)
File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 89, in train_epoch
self.recorder.record('train')
File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 97, in record
self.logger.info(self)
Message: <lanedet.utils.recorder.Recorder object at 0x7fd865ac7eb0>
Arguments: ()
loss.backward()
toloss.sum().backward()
. However, this would cause the recorder and logging function to fail: --- Logging error --- Traceback (most recent call last): File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 1085, in emit msg = self.format(record) File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 929, in format return fmt.format(record) File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 668, in format record.message = record.getMessage() File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 371, in getMessage msg = str(self.msg) File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 116, in str loss_state.append('{}: {:.4f}'.format(k, v.avg)) File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 32, in avg d = torch.tensor(list(self.deque)) ValueError: only one element tensors can be converted to Python scalars Call stack: File "main.py", line 66, inDoes anyone have a idea how to solve this? Any help is appreciated! Thank you.