facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

FloatingPointError: gradients are Nan/Inf #4118

Open hjc3613 opened 2 years ago

hjc3613 commented 2 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

when I train an english to chinese model using transformer_iwslt_de_en architecture, after 6 steps, the error occured, which like the follow picture: error

What have you tried?

the train args was copyed from fairseq/examples/translation/readme.md, I only add two new args, which is marked as follow: train_shell

What's your environment?

zhanchey commented 2 years ago

same question, have you solved?

hjc3613 commented 2 years ago

I re-tokenize the training data, ensure that punctuation apart from its adjacent word, this resolve the problem. now I face another problem: the ppl and bleu seems too disappointed, which bleu about 8, much worse than 20 generated by opennmt-tf

NTR0314 commented 1 year ago

I had the same problem when using an Apex install of another GPU: A100 - V100.

Therefore my solution was to create seperate conda environments for different GPUs