Closed rondogency closed 3 years ago
At the moment there isn't a good way to enforce reproducibility when using distributed training due to the different compute orders. While in the current setting complete reproducibility is hard to obtain, potential ideas for reducing variance is to 1) initialize once and store the initial random weight, and then always train from that set of weights, and 2) enforce sample ordering in data loader by processing and storing processed samples beforehand.
@rondogency Has the training been stabilized after you have fixed the random seed?
@szha thanks for the answer!
@sxjscience yes, I am using what we have in gluoncv https://github.com/dmlc/gluon-cv/blob/master/gluoncv/utils/random.py, now the loss is more stable now (less variance from multiple runs)
suggest to add same thing to gluonnlp and allow user to pass random seed to bert run_pretraining.py script
Thanks, we will ensure that all scripts use this set_seed function: https://github.com/dmlc/gluon-nlp/blob/09f343564e4f735df52e212df87ca073a824e829/src/gluonnlp/utils/misc.py#L187-L191
I'll close this issue for now as it should have been solved in the master version. Feel free to reopen.
Description
I am using 8 node Horovod BERT pretraining on gluonnlp==0.10.0, currently running for 10k steps and found that on each run the final loss and accuracy after 10k steps are not stabilized and mlm loss varies a lot.
So I am wondering if there is a stable way to generate loss and accuracy for pretraining
Error Message
There is 4 runs logs
To Reproduce
I am using Sagemaker to trigger Horovod, and shared config will be
Steps to reproduce
(Paste the commands you ran that produced the error.)
1. 2.
What have you tried to solve it?
1. 2.
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below: