A few insights have been received after playing around with the model since publication, including some methods to circumvent training instability, especially when training on larger corpora.
Training instability usually manifests as NaN loss errors. To circumvent this, some easy things to try:
1) Increase batch size to at least 256
2) Reduce LR to 1e-4 or 1e-5. If you are training over a very large corpus, shouldn’t affect representation quality much.
3) Use some learning rate scheduler, slanted triangular scheduler has worked well for me. Make sure you tinker with the total number of epochs you train over.
4) Clamp the KLD to some max value (e.g. 1000) so it doesn’t diverge
5) Use a different KLD annealing scheduler (ie sigmoid)
A few insights have been received after playing around with the model since publication, including some methods to circumvent training instability, especially when training on larger corpora.
Training instability usually manifests as NaN loss errors. To circumvent this, some easy things to try:
1) Increase batch size to at least 256 2) Reduce LR to 1e-4 or 1e-5. If you are training over a very large corpus, shouldn’t affect representation quality much. 3) Use some learning rate scheduler, slanted triangular scheduler has worked well for me. Make sure you tinker with the total number of epochs you train over. 4) Clamp the KLD to some max value (e.g. 1000) so it doesn’t diverge 5) Use a different KLD annealing scheduler (ie sigmoid)
Document these insights in repo for future use.