awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

ppl decreasing very slow #1088

Closed muramaso closed 1 year ago

muramaso commented 1 year ago

I am trying to switch from sockeye1 to sockeye3, but I found that when using the same datas and arguments, The decrease in PPL of sockeye3 during training is very slow, much slower than sockeye1. I am really confused. Here are the training commands I used. sockeye1: sockeye/train.py --source train_src.src --target train_zh.zh --validation-source valid_src.src --validation-target valid_zh.zh --device-ids 1 2 3 4 --output sockeye1_enzh --batch-type sentence --batch-size 600 --max-num-checkpoint-not-improved 8 --metrics perplexity --optimized-metric perplexity --checkpoint-interval 10000 --optimizer adam --decode-and-evaluate -1 --max-seq-len 100:100 --loss cross-entropy --seed 1 --shared-vocab sockeye3: torchrun --no_python --nproc_per_node 4 sockeye-train --prepared-data s3_prepare --validation-source valid_src.src --validation-target valid_zh.zh --output sockeye3_enzh --amp --batch-type sentence --batch-size 150 --update-interval 1 --checkpoint-interval 10000 --max-num-checkpoint-not-improved 8 --optimizer-betas 0.9:0.98 --dist --learning-rate-scheduler-type inv-sqrt-decay --seed 1 --weight-tying-type none --num-words 50000:50000 --quiet-secondary-workers

mjdenkowski commented 1 year ago

The WMT 2014 English-German tutorial includes commands for training a big transformer model using a standard recipe. We recommend this as a starting point for training Sockeye models on new data sets.

muramaso commented 1 year ago

got it, but i still want to know which argument might effect the decreasing speed of ppl. Is there a common pattern that transformer big's ppl always decrease slower due to more inner arguments?

mjdenkowski commented 1 year ago

Learning rate and scheduler can both affect the model's learning curve by changing the size of the parameter update at each training step. From the arguments, it looks like the Sockeye 1 command is using the default learning rate and scheduler while the Sockeye 3 command is using the default learning rate with the inv-sqrt-decay scheduler. It also looks like both commands are using the default model size and architecture.

Defaults for all of these settings may be updated in newer versions of Sockeye. You can check the training logs to see details about the model and training setup for each run.