Open yang-yi-fan opened 3 years ago
I should also add that I use the pytorch to train the bts. Could you please tell me the version of Pytorch at the time of training? Thanks!
have you settled the NCCL error?I had the same NCCL problem
Dear authors,
I have the same NCCL error. Could you help fix this problem and give your advice?
165248 Traceback (most recent call last):
165249 File "bts_main.py", line 614, in
@cogaplex-bts
Thanks, Dong
Thanks for your excellent work! But I encountered some problems in training the KITTI dataset. I used two NVIDIA Gerforce 2080ti for training, and set --multiprocessingdistributed==True, --do online Eval = = True, in other words, and the parameter settings in are consistent with the file called arguments train_ eigen.txt. The program can run normally for a period of time (including training, evaluating network), and then errors about ncll will appear and the errors are as follows: Exception in thread Thread-1: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 202, in run data = self._queue.get(True, queue_wait_duration) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 202, in run data = self._queue.get(True, queue_wait_duration) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError
Traceback (most recent call last): File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 626, in
main()
File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 620, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 526, in main_worker eval_measures = online_eval(model, dataloader_eval, gpu, ngpus_per_node) File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 314, in online_eval dist.all_reduce(tensor=eval_measures, op=dist.ReduceOp.SUM, group=group) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1565272279342/work/torch/lib/c10d/ProcessGroupNCCL.cpp:264, unhandled system error
It seems that the problem occurred just after the evaluation data set was tested and the calculation indicators were summarized. Hope for answers, thank you very much!