OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

NaN values when training big transformer model #2559

Closed PC91 closed 5 months ago

PC91 commented 5 months ago

Hello,

I trained a Vanilla Transformer on 4 GPU cards. The training is in FP16:

# Training parameters
train_steps: 900000
save_checkpoint_steps: 1000
report_every: 1000
keep_checkpoint: 10
world_size: 4
gpu_ranks: [0, 1, 2, 3]
num_workers: 2

# Model parameters
decoder_type: transformer
encoder_type: transformer
word_vec_size: 1024
hidden_size: 1024
layers: 6
transformer_ff: 4096
heads: 16
dropout: 0.0
attention_dropout: 0.0
dropout_steps: 0
accum_count: [1]
accum_steps: [0]
label_smoothing: 0.1
share_decoder_embeddings: True
share_embeddings: True
model_dtype: fp16

# Optimizer and learning rate scheduler
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'
decay_method: noam
warmup_steps: 16000
learning_rate: 6
max_grad_norm: 2.0

# Batch parameters
batch_size: 54000
batch_type: tokens
normalization: tokens
parallel_mode: data_parallel

The perplexity and cross-entropy become NaN after about 480000 steps:

[2024-01-28 08:32:40,564 INFO] Step 487000/900000; acc: 81.4; ppl:   7.7; xent: 2.0; lr: 0.00027; sents: 5745567; bsz: 32590/36982/1436; 161906/183723 tok/s; 392485 sec;
[2024-01-28 08:33:01,690 INFO] Saving checkpoint /result/20240123-141317/model/model_step_487000.pt
[2024-01-28 08:46:03,824 INFO] Step 488000/900000; acc: 81.4; ppl:   7.7; xent: 2.0; lr: 0.00027; sents: 5760388; bsz: 32688/37163/1440; 162778/185060 tok/s; 393288 sec;
[2024-01-28 08:46:25,633 INFO] Saving checkpoint /result/20240123-141317/model/model_step_488000.pt
[2024-01-28 08:59:27,117 INFO] Step 489000/900000; acc: 81.4; ppl:   nan; xent: nan; lr: 0.00027; sents: 5797358; bsz: 32657/37117/1449; 162615/184824 tok/s; 394091 sec;
[2024-01-28 08:59:48,568 INFO] Saving checkpoint /result/20240123-141317/model/model_step_489000.pt
[2024-01-28 09:12:48,224 INFO] Step 490000/900000; acc: 81.4; ppl:   nan; xent: nan; lr: 0.00027; sents: 5758101; bsz: 32770/37249/1440; 163625/185988 tok/s; 394893 sec;

This error does not happen when I trained on FP32. What could cause this error with FP16 (exploded gradient value, conversion error, etc) ?

Thank you, Thai-Chau

vince62s commented 5 months ago

likelihood is your learning rate being too high but there could be some other reasons. would recommend to use fusedadam instead of adam for fp16, you need to install apex (look at the readme)