Closed addf400 closed 5 years ago
We used exactly the same corpus.
The main difference is that we trained the base model with much bigger batch sizes (8x larger) for less iterations (300k instead of 2.4M, 8x less). We also had to change the learning rates accordingly.
When we tried to do the same for the large models, we ran into many stability issues. It is quite possible that some more optimization tricks could improve the large models as well.
We used exactly the same corpus.
The main difference is that we trained the base model with much bigger batch sizes (8x larger) for less iterations (300k instead of 2.4M, 8x less). We also had to change the learning rates accordingly.
When we tried to do the same for the large models, we ran into many stability issues. It is quite possible that some more optimization tricks could improve the large models as well.
Can you tell me what is your learning rate in your base model training? It will save much resource and convenient to us to reproduction. Thanks you very much!
LR = 0.0005
The base model has a very powerful performance ! What are the training steps / batch size / learning rate for the base model ? Is that all same with the large model ? Do you have any other corpus for training the base model except wiki or bookcorpus ?