Open ghrua opened 2 years ago
Hi @ghrua, Have you found the solution? I am facing the same problem...
Hi @jamfly
Yes, actually there are two solutions I think:
The second one is somehow tricky... But it works for me.
Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:
Thank you in advance.
Yes, I have tried FP16 from scratch with many hyper-parameters, e.g., different values of warmup updates and clip norm, but they didn't work for me.
In the section of batching, the author said "This gives an effective batch size of 65K tokens for WIKITEXT-103.", where 65,000 / 8 / 3072 is around 2.6. I think that's why they set update-freq
to 3.
Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:
- Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
- I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?
Thank you in advance.
Hi @ghrua, I got it, they are using 4096 (tokens) 8 (GPUs) 2 (update-freq). Anyways, thank you for your kindly help and suggestions, I really appreciate it. Thank you.
can you replicate the results in paper? I ran the same recipe as yours, got a test ppl of 29.14, but the results in paper should be 18.7.
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Yes, I can, as far as setting the right update frequency according to the gpus and batch size.
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Yes, I can, as far as setting the right update frequency according to the gpus and batch size.
Which script did you use to evaluate your model?
❓ Questions and Help
What is your question?
Got
inf
loss and gradient overflow when running the code example of adaptive input representation with--fp16
. I am trying to reproduce the results of Baevski and Auli, 2018, and the code example provided by fairseq is pretty fine withfp32
. However, the model doesn't work well when I usefp16
to reduce the training time, following Baevski and Auli, 2018 . Are there any tips for preventing the model from inf loss?Code
Almost the same as the code in this link except for the
fp16
argument:Results:
What's your environment?
pip
, source): source