Hello! Thanks for sharing your excellent work! I tried the training code and have the following two problems.
1) I have tried training with 2 GPUs using the command in run.sh. But the training stoped after about several hours with errors about l_pixel.backward:
/IDM-main/model/model.py", line 86,in optimize parameters:
l pix.backward()
anaconda3/envs/idm3/lib/python3.9/site-packages/torch/autograd/init .py", in line 200, in backward
Variable. execution engine.run backward #Calls into C++ engine to run the backward pass
RuntimeError:[../third party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]: 47290
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid:2176380) of binary:
Have you ever met this problem?
2) Regarding the main function, since I met the problem of 1), I tried to use sr.py (in SR3 repository) instead of idm_main.py. idm_main.py uses nn.parallel.DistributedDataParallel while SR3 uses nn.DataParallel. However, when I tried sr.py to use 2 gpus for training, only one single GPU was used. I wonder if you also found this problem and thus use nn.parallel.DistributedDataParallel. Is it caused by basicsr?
Thanks for your patience in reading this issue! Looking forward to your reply.
Hello! Thanks for sharing your excellent work! I tried the training code and have the following two problems.
1) I have tried training with 2 GPUs using the command in run.sh. But the training stoped after about several hours with errors about l_pixel.backward:
/IDM-main/model/model.py", line 86,in optimize parameters: l pix.backward() anaconda3/envs/idm3/lib/python3.9/site-packages/torch/autograd/init .py", in line 200, in backward Variable. execution engine.run backward #Calls into C++ engine to run the backward pass RuntimeError:[../third party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]: 47290 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid:2176380) of binary:
Have you ever met this problem?
2) Regarding the main function, since I met the problem of 1), I tried to use sr.py (in SR3 repository) instead of idm_main.py. idm_main.py uses nn.parallel.DistributedDataParallel while SR3 uses nn.DataParallel. However, when I tried sr.py to use 2 gpus for training, only one single GPU was used. I wonder if you also found this problem and thus use nn.parallel.DistributedDataParallel. Is it caused by basicsr?
Thanks for your patience in reading this issue! Looking forward to your reply.