High training/validation PPL observed with v3 compared to v2

robertBrnnn commented 3 weeks ago

Hi,

I've been migrating to v3 and have noticed extremely high training and validation perplexity with v3 compared to v2 when training with the same data/vocabs. I initially thought it might be a config difference between v2 and v3 that I missed, but after multiple attempts, no config change I've made has resulted in reduced PPL. Accuracy scores, are similar to those we get with v2.

For instance here are logs for the same data trained against v2 and v3, at the same step: v3.5:

11692:[2024-05-05 07:33:54,514 INFO] Step 385000/700000; acc: 80.9; ppl:   8.6; xent: 2.1; lr: 0.00014; sents:  450748; bsz: 2920/2955/141; 35454/35879 tok/s; 1028266 sec;
[2024-05-05 07:36:30,486 INFO] valid stats calculation
                           took: 155.96873307228088 s.
[2024-05-05 07:36:30,490 INFO] Train perplexity: 9.37828
[2024-05-05 07:36:30,490 INFO] Train accuracy: 79.2356
[2024-05-05 07:36:30,490 INFO] Sentences processed: 4.27918e+08
[2024-05-05 07:36:30,490 INFO] Average bsz: 2919/2948/139
[2024-05-05 07:36:30,490 INFO] Validation perplexity: 10.0459
[2024-05-05 07:36:30,491 INFO] Validation accuracy: 76.9249
[2024-05-05 07:36:30,501 INFO] Stalled patience: 1/5

v2.3:

[2024-05-20 08:08:51,682 INFO] Step 385000/700000; acc:  82.04; ppl:  2.03; xent: 0.71; lr: 0.00014; 36387/36815 tok/s; 1004006 sec
[2024-05-20 08:10:31,388 INFO] Validation perplexity: 2.81274
[2024-05-20 08:10:31,389 INFO] Validation accuracy: 77.7568
[2024-05-20 08:10:31,389 INFO] Stalled patience: 3/5

This is the model config:

accum_count: 8
accum_steps: 0
adam_beta1: 0.9
adam_beta2: 0.998
batch_size: 4096
batch_size_multiple: 1
batch_type: tokens
bucket_size: 32768
decay_method: noam
decoder_type: transformer
dropout: 0.1
early_stopping: 5
encoder_type: transformer
heads: 8
hidden_size: 512
keep_checkpoint: 20
label_smoothing: 0.1
layers: 6
learning_rate: 2.0
max_generator_batches: 0
max_grad_norm: 0.0
normalization: tokens
optim: adam
param_init: 0.0
param_init_glorot: 'true'
pool_factor: 8192
position_encoding: 'true'
queue_size: 1024
report_every: 100
save_checkpoint_steps: 5000
save_model: <SAVE_MODEL_PATH>
seed: 1234
self_attn_type: scaled-dot
share_vocab: true
src_seq_length: 200
src_vocab: <PATH_TO_VOCAB>
src_vocab_size: 38000
tgt_seq_length: 200
train_steps: 700000
transformer_ff: 2048
valid_batch_size: 16
valid_steps: 5000
warmup_steps: 8000
word_vec_size: 512

In the case of v2 config, self_attn_type is removed and hidden_size is changed to the v2 param rnn_size. Have I missed some obvious configuration parameter? Or could there be something else that explains the difference between versions?

Thanks

vince62s commented 3 weeks ago

Hi @robertBrnnn If I recall well there were two aspects 1) normalization was incorrectly calculated 2) valid was without label smoothing hence not comparable with train (but again this is a long time I may be wrong) also now we are switching to a splin-off of opennmt-py here: https://github.com/eole-nlp/eole

robertBrnnn commented 3 weeks ago

Hi @vince62s , Thanks for your reply.

Just to clarify the above, are you saying normalization was incorrectly calculated and valid is without label smoothing, in v3 or in v2? We've continued with v2 for the time being as we're getting best results with it currently.

Eole looks great, I like the direction you're taking the project, looking forward to trying out the first release!

vince62s commented 3 weeks ago

We've continued with v2 for the time being as we're getting best results with it currently.

You should not get worse results with v3. I have always made sure we get the same results. The only I see above is the bucket_size it is too small for v3, it should be > 200K (I take 262144) to make sure examples are properly shuffled but otherwise you should get similar results.

OpenNMT / OpenNMT-py

High training/validation PPL observed with v3 compared to v2 #2591