Open tushardmaske opened 2 years ago
I have used your command in my environment and it works well. My hardware setting is two GPUs on one node
Thank you very much for your response.
The training started after I set "export NCCL_P2P_DISABLE=1" as suggested on this site.
But now, the problem is, it is not training in combined approach. I mean 2 gpus suppose to do one task but It is training like two same tasks on 2 gpus. I get 2 logs showing same time left from both gpus. --nproc_per_node=2
Same code with --nproc_per_node=1
Infact with 1 gpus it shows less "time left" (whereas it should show more "time left" with 1 gpu and less "time left" with 2 gpus)
I do not understand whether it is something that I am setting wrong or is it versions of any library that I am using wrong.
Hello, I am also experiencing the same problem now, have you solved it? Can you please tell me the solution?
Best, Rui
Hello, When I start multi gpu training. I run the following command. python -m torch.distributed.launch --nproc_per_node=2 train.py --split eigen_zhou --learning_rate 1e-4 --height 320 --width 1024 --scheduler_step_size 14 --batch_size 2 --model_name mono_model --png --data_path ../4_monodepth2/data/KITTI/ --num_epochs 40 --log_dir weights_logs
If I set --nproc_per_node=1, then it runs alright on single GPU, but if I set --nproc_per_node=2, then it just prints the comments before it initializes distributed training but after that, it just stucks. From nvidia-smi, I can see the GPUs are 100% occupied, but training does not start (weight_logs also does not get created)
I have attached screenshot where it gets stuck. Can you please help me with knowing what this might be?
Thank you for you time.