facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Training fails when using the left target padding "--left-pad-target True" #2640

Open alphadl opened 4 years ago

alphadl commented 4 years ago

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

When I train the vanilla Transformer_base and _big models with setting the left-pad-target as True, the fairseq will report an error: FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

Code

The training script of base model I used is: python train.py databin/ende/wmt14/ -a transformer --share-all-embeddings --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr 1e-3 --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.98)' --task translation --max-tokens 8192 --update-freq 2 --dropout 0.3 --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --fp16 --ddp-backend=no_c10d --max-source-positions 10000 --max-target-positions 10000 --max-update 100000 --seed 1 --save-dir checkpoint/ende/wmt14/ --left-pad-target True

#### What have you tried? I also met this issue in big model training and large batch (458k tokens) training. #### What's your environment? - fairseq Version (e.g., 1.0 or master): 0.9 - PyTorch Version (e.g., 1.0) 1.4 - OS (e.g., Linux): Linux - How you installed fairseq (`pip`, source): source - Build command you used (if compiling from source): pip install --editable $fairseq_path - Python version: 3.7
alphadl commented 4 years ago

CC @NonvolatileMemory

NonvolatileMemory commented 4 years ago

Got the same bug when setting the left-target-padding True.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

felixkreuk commented 2 years ago

@alphadl @NonvolatileMemory Have you managed to solve this issue?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!