Open joeforan76 opened 3 years ago
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
bump
What is your question?
Frequency
My understanding is that all three should be equivalent, but I found that while I am able to train to convergence with similar results under Scenarios A and B, when I try Scenario C, I get exploding losses on the third epoch. I tried varying the random seed to see if it was by chance, but I consistently got the same result. All other hyperparameters are kept the same. As Scenario C gives me better utilisation of GPU memory, I would like to be able to use it. Perhaps by adjusting the learning rate and other hyperparameters, I can get it to converge, but firstly, I'd like to understand why it behaves differently than the other two setups.
The log output just before failure for Scenario C is
whereas for approximately the equivalent step under Scenario B it is
Note that the loss_scales and gnorms are vastly different. Under Scenario C the loss scale goes to a very low value very quickly, whereas for the other scenarios, it oscillates between values of 8.0 and 16.0
What's your environment?
pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
v0.10.2
pip
, source):source
pip install --editable ./