Ree1s / IDM

279 stars 20 forks source link

About training on multi-gpus #13

Open stayhungry1 opened 1 year ago

stayhungry1 commented 1 year ago

Hello! Thanks for sharing your excellent work! I tried the training code and have the following two problems.

1) I have tried training with 2 GPUs using the command in run.sh. But the training stoped after about several hours with errors about l_pixel.backward:

/IDM-main/model/model.py", line 86,in optimize parameters: l pix.backward() anaconda3/envs/idm3/lib/python3.9/site-packages/torch/autograd/init .py", in line 200, in backward Variable. execution engine.run backward #Calls into C++ engine to run the backward pass RuntimeError:[../third party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]: 47290 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid:2176380) of binary:

Have you ever met this problem?

2) Regarding the main function, since I met the problem of 1), I tried to use sr.py (in SR3 repository) instead of idm_main.py. idm_main.py uses nn.parallel.DistributedDataParallel while SR3 uses nn.DataParallel. However, when I tried sr.py to use 2 gpus for training, only one single GPU was used. I wonder if you also found this problem and thus use nn.parallel.DistributedDataParallel. Is it caused by basicsr?

Thanks for your patience in reading this issue! Looking forward to your reply.

SijieLiu518 commented 11 months ago

I meet the same problem