Multi-GPU training hangs

tushardmaske commented 2 years ago

Hello, When I start multi gpu training. I run the following command. python -m torch.distributed.launch --nproc_per_node=2 train.py --split eigen_zhou --learning_rate 1e-4 --height 320 --width 1024 --scheduler_step_size 14 --batch_size 2 --model_name mono_model --png --data_path ../4_monodepth2/data/KITTI/ --num_epochs 40 --log_dir weights_logs

If I set --nproc_per_node=1, then it runs alright on single GPU, but if I set --nproc_per_node=2, then it just prints the comments before it initializes distributed training but after that, it just stucks. From nvidia-smi, I can see the GPUs are 100% occupied, but training does not start (weight_logs also does not get created)

I have attached screenshot where it gets stuck. Can you please help me with knowing what this might be? diffnet_multigpuStuck

Thank you for you time.

brandleyzhou commented 2 years ago

I have used your command in my environment and it works well. My hardware setting is two GPUs on one node

tushardmaske commented 2 years ago

Thank you very much for your response.

The training started after I set "export NCCL_P2P_DISABLE=1" as suggested on this site.

But now, the problem is, it is not training in combined approach. I mean 2 gpus suppose to do one task but It is training like two same tasks on 2 gpus. I get 2 logs showing same time left from both gpus. --nproc_per_node=2

Same code with --nproc_per_node=1

Infact with 1 gpus it shows less "time left" (whereas it should show more "time left" with 1 gpu and less "time left" with 2 gpus)

I do not understand whether it is something that I am setting wrong or is it versions of any library that I am using wrong.

Hyyzhangrui commented 1 year ago

Hello, I am also experiencing the same problem now, have you solved it? Can you please tell me the solution?

Best, Rui

brandleyzhou / DIFFNet

Multi-GPU training hangs #10