Open RulinShao opened 2 years ago
I could load the saved checkpoint and resume training, the NaN doesn't seem to appear in the same iteration, instead, it appears every 16900 iterations. I.e., I resumed the training from the checkpoint saved at 10000th iteration, it reported NaN at 26900th iterations instead. Any insight on this?
Instructions To Reproduce the Issue:
Hi, thanks for the codes! I tried to reproduce the UniT vqa2 single task training example as given in the doc: https://mmf.sh/docs/projects/unit/ The default setting uses a batch size of 64 with 64 GPUs. I want to reproduce the same result with 8 GPUs combined with gradient accumulation by setting the update frequency to 8. My script:
However, after setting the
training.update_frequency
to 8 or 4, NaN appeared in the loss and the training terminated. The log is as below:Expected behavior:
It should finish the training normally as get the expected results as in the UniT paper Table 1 line 1.
Environment:
I use a p3.16 aws instance with 8 V100s of 16Gb memory. My environment is built strictly following the MMF installation instructions: