About training on multi-gpus

Hello! Thanks for sharing your excellent work! I tried the training code and have the following two problems.

1) I have tried training with 2 GPUs using the command in run.sh. But the training stoped after about several hours with errors about l_pixel.backward:

/IDM-main/model/model.py", line 86,in optimize parameters: l pix.backward() anaconda3/envs/idm3/lib/python3.9/site-packages/torch/autograd/init .py", in line 200, in backward Variable. execution engine.run backward #Calls into C++ engine to run the backward pass RuntimeError:[../third party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]: 47290 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid:2176380) of binary:

Have you ever met this problem?

2) Regarding the main function, since I met the problem of 1), I tried to use sr.py (in SR3 repository) instead of idm_main.py. idm_main.py uses nn.parallel.DistributedDataParallel while SR3 uses nn.DataParallel. However, when I tried sr.py to use 2 gpus for training, only one single GPU was used. I wonder if you also found this problem and thus use nn.parallel.DistributedDataParallel. Is it caused by basicsr?

Thanks for your patience in reading this issue! Looking forward to your reply.

Ree1s / IDM

About training on multi-gpus #13