Closed ksh11023 closed 1 year ago
Thanks for your attention!
This code could be trained with multi GPUs by using DistributedDataParallel
.
For instance, you could try the following command for training ( please note that the --batch_size
in the arguments means the batch size in ONE GPU):
# train SDFA-Net at stage1 on 4 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python\
-m torch.distributed.launch --nproc_per_node=4 --master_port 12345\
train_dist.py\
--name SDFA-Net-SwinT-M_192Crop_KITTI_S_St1_B12\
--exp_opts options/SDFA-Net/train/sdfa_net-swint-m_192crop_kitti_stereo_stage1.yaml\
--batch_size 3\
--epoch 25\
--visual_freq 2000\
--save_freq 5
Hello, Thank you for the swift reply!
I have tried the way you have mentioned, However I got this error.
I am running my code on RTX 3090 CUDA 11.2, pyorch 1.9.0. Can you please give me some advices?
Thank you.
From the error message, it seems that the arguments of NCCL are invalid. You may check that torch
could recognize all GPUs (I guess this is the main problem because of the message .. NCCL INFO NET/IB No device found.
) and the port:12345 is free. This repo worked well on V100, CUDA 10.2, and Pytoch 1.7.0, but I'm not sure if the code supports parallel training on a higher version.
I hope the above information can help you.
okay thanks!
Hello,
Thank you for the nice Work!
Can this code be trained with multi gpus? 'CUDA_VISIBLE_DEVICES=0,1,2,3 python train_dist.py dosen't work..
Thankyou.