Dear authors:
Thank you for your great work. Recently I tried to reproduce your paper and perform a complete train & search. By modifying the dist_train.sh file and changing nproc_per_node to 4 to suit my machine(4x3090), I succeeded to finish training stage 0-2, but when entering stage 3, the code hangs after printing infos for stage 3 epoch 1 step 0.
After doing some research, I found several interesting & strange details:
The code doesn't hang immediatly after entering stage 3, it successfully perform several complete steps of forward, backward, reduce and step, and hangs around step 6-7 (7 in most cases)
The (apparent) reason the program hangs is, one of the process is stuck in optimizer.step() after sucessfully calling loss.backward(). This is strange as I can't imagine how can optimizer.step() fail if gradients are propagated backward successfully but that's exactly the case. This process never print any debug log I set after optimizer.step(). The other processes just wait for it and the whole program hangs.
Process of any rank (including rank 0) can get stuck. Only one process gets stuck each time. The other three processes run just fine until the next all_reduce.
The above points, although are seemingly strange and random, can be reproduced stably on our machine....
We have tried several different versions of pytorch docker from nvidia(https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (1.7.0, 1.8.0 and newer ones) and this problem just continues. As both the codes and dockers are vanila, I can't tell which side the bug is coming from.
Update: Just when I type these lines, a classmate of mine told me that, after changing to another docker with pytorch 1.6.1 and cuda 11.0 (tag 20.06, the oldest docker from nvidia that supports cuda 11.0), this problem disappears mysteriously. I'm still posting this issue to tell researchers in the future, don't run this code on pytorch>1.6.1.
Dear authors: Thank you for your great work. Recently I tried to reproduce your paper and perform a complete train & search. By modifying the
dist_train.sh
file and changingnproc_per_node
to 4 to suit my machine(4x3090), I succeeded to finish training stage 0-2, but when entering stage 3, the code hangs after printing infos for stage 3 epoch 1 step 0.After doing some research, I found several interesting & strange details:
optimizer.step()
after sucessfully callingloss.backward()
. This is strange as I can't imagine how can optimizer.step() fail if gradients are propagated backward successfully but that's exactly the case. This process never print any debug log I set afteroptimizer.step()
. The other processes just wait for it and the whole program hangs.all_reduce
.The above points, although are seemingly strange and random, can be reproduced stably on our machine.... We have tried several different versions of pytorch docker from nvidia(https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) (1.7.0, 1.8.0 and newer ones) and this problem just continues. As both the codes and dockers are vanila, I can't tell which side the bug is coming from.
Update: Just when I type these lines, a classmate of mine told me that, after changing to another docker with pytorch 1.6.1 and cuda 11.0 (tag 20.06, the oldest docker from nvidia that supports cuda 11.0), this problem disappears mysteriously. I'm still posting this issue to tell researchers in the future, don't run this code on pytorch>1.6.1.