allenai / vampire

Variational Methods for Pretraining in Resource-limited Environments
Apache License 2.0
174 stars 33 forks source link

Document training instability #51

Closed kernelmachine closed 4 years ago

kernelmachine commented 4 years ago

A few insights have been received after playing around with the model since publication, including some methods to circumvent training instability, especially when training on larger corpora.

Training instability usually manifests as NaN loss errors. To circumvent this, some easy things to try:

1) Increase batch size to at least 256 2) Reduce LR to 1e-4 or 1e-5. If you are training over a very large corpus, shouldn’t affect representation quality much. 3) Use some learning rate scheduler, slanted triangular scheduler has worked well for me. Make sure you tinker with the total number of epochs you train over. 4) Clamp the KLD to some max value (e.g. 1000) so it doesn’t diverge 5) Use a different KLD annealing scheduler (ie sigmoid)

Document these insights in repo for future use.

talolard commented 4 years ago

Would be great to add this to the main README, I just followed along and scratched my head for a while about nans. Learning rate 1e-4 helped

kernelmachine commented 4 years ago

This is addressed in #62 , thanks for suggesting!