multi gpu training - Githubissues

ZM-Zhou / SDFA-Net_pytorch

Apache License 2.0

17 stars 5 forks source link

multi gpu training #1

Closed ksh11023 closed 1 year ago

ksh11023 commented 1 year ago

Hello,

Thank you for the nice Work!

Can this code be trained with multi gpus? 'CUDA_VISIBLE_DEVICES=0,1,2,3 python train_dist.py dosen't work..

Thankyou.

ZM-Zhou commented 1 year ago

Thanks for your attention! This code could be trained with multi GPUs by using DistributedDataParallel. For instance, you could try the following command for training ( please note that the --batch_size in the arguments means the batch size in ONE GPU):

# train SDFA-Net at stage1 on 4 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python\
 -m torch.distributed.launch --nproc_per_node=4 --master_port 12345\
 train_dist.py\
 --name SDFA-Net-SwinT-M_192Crop_KITTI_S_St1_B12\
 --exp_opts options/SDFA-Net/train/sdfa_net-swint-m_192crop_kitti_stereo_stage1.yaml\
 --batch_size 3\
 --epoch 25\
 --visual_freq 2000\
 --save_freq 5

ksh11023 commented 1 year ago

Hello, Thank you for the swift reply!

I have tried the way you have mentioned, However I got this error.

I am running my code on RTX 3090 CUDA 11.2, pyorch 1.9.0. Can you please give me some advices?

Thank you.

ZM-Zhou commented 1 year ago

From the error message, it seems that the arguments of NCCL are invalid. You may check that torch could recognize all GPUs (I guess this is the main problem because of the message .. NCCL INFO NET/IB No device found.) and the port:12345 is free. This repo worked well on V100, CUDA 10.2, and Pytoch 1.7.0, but I'm not sure if the code supports parallel training on a higher version. I hope the above information can help you.

ksh11023 commented 1 year ago

okay thanks!