FloatingPointError: gradients are Nan/Inf

hjc3613 commented 2 years ago

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

when I train an english to chinese model using transformer_iwslt_de_en architecture, after 6 steps, the error occured, which like the follow picture: error

What have you tried?

the train args was copyed from fairseq/examples/translation/readme.md, I only add two new args, which is marked as follow: train_shell

What's your environment?

fairseq Version (latest):
PyTorch Version (1.8.1+cu111)
OS (e.g., Linux):
How you installed fairseq ( git clone & pip install ...):
Python version: 3.7
GPU models and configuration: K80, cuda11.1

zhanchey commented 2 years ago

same question, have you solved?

hjc3613 commented 2 years ago

I re-tokenize the training data, ensure that punctuation apart from its adjacent word, this resolve the problem. now I face another problem: the ppl and bleu seems too disappointed, which bleu about 8, much worse than 20 generated by opennmt-tf

NTR0314 commented 1 year ago

I had the same problem when using an Apex install of another GPU: A100 - V100.

Therefore my solution was to create seperate conda environments for different GPUs

facebookresearch / fairseq