Turoad / lanedet

An open source lane detection toolbox based on PyTorch, including SCNN, RESA, UFLD, LaneATT, CondLane, etc.
Apache License 2.0
561 stars 93 forks source link

Problems when training with multiple GPUs #78

Open EthanLeong opened 1 year ago

EthanLeong commented 1 year ago

Hi,

Thank the author for this amazing repository. I am having problems with training the model with multiple GPUs and I wonder if anyone else is also having the problem. The training is fine when using a a single RTX3090, but whenever I tried to use 2 GPUs with the following command: python main.py configs/resa/resa34_openlane.py --gpus 0 1 The following error occurs: /home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' Traceback (most recent call last): File "main.py", line 66, in main() File "main.py", line 36, in main runner.train() File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train self.train_epoch(epoch, train_loader) File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 75, in train_epoch loss.backward() File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 141, in backward gradtensors = _make_grads(tensors, gradtensors) File "/home/anaconda3/envs/lanedet/lib/python3.8/site-packages/torch/autograd/init.py", line 50, in _make_grads raise RuntimeError("grad can be implicitly created only for scalar outputs") RuntimeError: grad can be implicitly created only for scalar outputs

After searching on the Internet, I found out that this error can be avoided by changing loss.backward() to loss.sum().backward(). However, this would cause the recorder and logging function to fail: --- Logging error --- Traceback (most recent call last): File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 1085, in emit msg = self.format(record) File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 929, in format return fmt.format(record) File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 668, in format record.message = record.getMessage() File "/home/anaconda3/envs/lanedet/lib/python3.8/logging/init.py", line 371, in getMessage msg = str(self.msg) File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 116, in str loss_state.append('{}: {:.4f}'.format(k, v.avg)) File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 32, in avg d = torch.tensor(list(self.deque)) ValueError: only one element tensors can be converted to Python scalars Call stack: File "main.py", line 66, in main() File "main.py", line 36, in main runner.train() File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 99, in train self.train_epoch(epoch, train_loader) File "/home/Documents/git/lanedet/lanedet/engine/runner.py", line 89, in train_epoch self.recorder.record('train') File "/home/Documents/git/lanedet/lanedet/utils/recorder.py", line 97, in record self.logger.info(self) Message: <lanedet.utils.recorder.Recorder object at 0x7fd865ac7eb0> Arguments: ()

Does anyone have a idea how to solve this? Any help is appreciated! Thank you.