Mistakes about distributed training during training

yang-yi-fan commented 3 years ago

Thanks for your excellent work! But I encountered some problems in training the KITTI dataset. I used two NVIDIA Gerforce 2080ti for training, and set --multiprocessingdistributed==True, --do online Eval = = True, in other words, and the parameter settings in are consistent with the file called arguments train_ eigen.txt. The program can run normally for a period of time (including training, evaluating network), and then errors about ncll will appear and the errors are as follows: Exception in thread Thread-1: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 202, in run data = self._queue.get(True, queue_wait_duration) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 202, in run data = self._queue.get(True, queue_wait_duration) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/queues.py", line 108, in get res = self._recv_bytes() File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

Traceback (most recent call last): File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 626, in main() File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 620, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 526, in main_worker eval_measures = online_eval(model, dataloader_eval, gpu, ngpus_per_node) File "/home/xxxx/Downloads/BTS-all/bts-master0/pytorch/bts_main.py", line 314, in online_eval dist.all_reduce(tensor=eval_measures, op=dist.ReduceOp.SUM, group=group) File "/home/xxxx/anaconda3/envs/pytorch120/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce work = group.allreduce([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1565272279342/work/torch/lib/c10d/ProcessGroupNCCL.cpp:264, unhandled system error

It seems that the problem occurred just after the evaluation data set was tested and the calculation indicators were summarized. Hope for answers, thank you very much！

yang-yi-fan commented 3 years ago

I should also add that I use the pytorch to train the bts. Could you please tell me the version of Pytorch at the time of training? Thanks!

everythoughthelps commented 3 years ago

have you settled the NCCL error?I had the same NCCL problem

dongli12 commented 3 years ago

Dear authors,

I have the same NCCL error. Could you help fix this problem and give your advice?

165248 Traceback (most recent call last): 165249 File "bts_main.py", line 614, in 165250 main() 165251 File "bts_main.py", line 608, in main 165252 mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) 165253 File "/scratch/workspace/dongl/anaconda3/envs/bts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn 165254 while not spawn_context.join(): 165255 File "/scratch/workspace/dongl/anaconda3/envs/bts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join 165256 raise Exception(msg) 165257 Exception: 165258 165259 -- Process 1 terminated with the following error: 165260 Traceback (most recent call last): 165261 File "/scratch/workspace/dongl/anaconda3/envs/bts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap 165262 fn(i, *args) 165263 File "/var/lib/docker/scratch/workspace/dongl/BTS/bts-master/pytorch/bts_main.py", line 518, in main_worker 165264 eval_measures = online_eval(model, dataloader_eval, gpu, ngpus_per_node) 165265 File "/var/lib/docker/scratch/workspace/dongl/BTS/bts-master/pytorch/bts_main.py", line 304, in online_eval 165266 dist.all_reduce(tensor=eval_measures, op=dist.ReduceOp.SUM, group=group) 165267 File "/scratch/workspace/dongl/anaconda3/envs/bts/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 902, in all_reduce 165268 work = group.allreduce([tensor], opts) 165269 RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1565272271120/work/torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error 165270 165271 /scratch/workspace/dongl/anaconda3/envs/bts/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 18 leaked semaphores to clean up at shutdown 165272 len(cache))

@cogaplex-bts

Thanks, Dong

cleinc / bts

Mistakes about distributed training during training #100