facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.36k stars 6.4k forks source link

Strange behaviour in translation task on TPU #4324

Open steinst opened 2 years ago

steinst commented 2 years ago

πŸ› Bug

I'm training on TPU using the km-en and ps-en datasets from the WMT 2020 shared task on parallel corpus filtering. I am using the hyperparameters from the paper, but the models don't seem to train properly on the TPUs. The translation output is just a sequence of the same symbols repeated multiple times.

Example: S-2198 αžœαžΆαž›αžαŸ’αžŸαžΆαž…αŸ‹αž“αŸ…αžαŸ†αž”αž“αŸ‹αž”αŸ‰αžΌαž› (αžαŸ’αžšαžΌαžœαž”αžΆαž“αž‚αŸαž˜αžΎαž›αžƒαžΎαž‰αžαžΆαž‡αžΆ "αžœαžΆαž›αžαŸ’αžŸαžΆαž…αŸ‹αžαŸ’αžšαž‡αžΆαž€αŸ‹") αž˜αžΆαž“αž›αž€αŸ’αžαžŽαŸˆαžŸαŸ’αžšαžŠαŸ€αž„αž‚αŸ’αž“αžΆαž›αžΎαž€αž›αŸ‚αž„αžαŸ‚αž—αŸ’αž›αŸ€αž„αž’αŸ’αž›αžΆαž€αŸ‹αž‡αžΆαž‡αžΆαž„αž—αŸ’αž›αŸ€αž„αŸ” T-2198 Polar Deserts (also seen as "cold deserts") have similar features, except the main form of precipitation is snow rather than rain. H-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."." D-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."

The same thing happens after 2000 updates and after 20,000 updates, even though loss and perplexity seem to be going down. After 20,000 updates loss is down to 6.649 and ppl to 21.96 for the validation set, so the model seems to be training:

2022-03-31 03:57:21 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 6.649 | nll_loss 4.457 | ppl 21.96 | wps 170916 | wpb 17521.6 | bsz 475.6 | num_updates 20000 | best_loss 6.649

When I train on a GPU (RTX 3060 Ti) using the same hyperparameters, on the other hand, the results are good (or at least what is expected). So this seems to have something to do with the TPU settings (or XLA?)

I have also tried this on another dataset, with similar results. So it's not the dataset. I also tried it with and without sentencepiece tokenization, also with similar results.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Training: fairseq-train \ data-bin/ \ --source-lang km --target-lang en \ --arch transformer \ --encoder-layers 5 --decoder-layers 5 \ --encoder-embed-dim 512 --decoder-embed-dim 512 \ --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \ --encoder-attention-heads 2 --decoder-attention-heads 2 \ --encoder-normalize-before --decoder-normalize-before \ --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \ --weight-decay 0.0001 \ --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \ --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \ --lr 1e-3 --stop-min-lr 1e-9 \ --max-tokens 4000 \ --update-freq 4 \ --max-update 20000 --tpu --distributed-world-size 8 --num-batch-buckets 8 \ --task translation --max-update 20000 --save-interval-updates 4000 --patience 10

Translate: fairseq-generate \ data-bin/ \ --source-lang km --target-lang en \ --gen-subset test \ --path checkpoints/checkpoint_best.pt \ --beam 5 --lenpen 1.2 \ --remove-bpe=sentencepiece \ --tpu --distributed-world-size 8 \ --sacrebleu

Translation output: S-1219 αž€αŸ’αžšαŸ…αž–αžΈαž€αžšαžŽαžΈαž‡αžΆαž€αŸ‹αžŸαŸ’αžŠαŸ‚αž„αž“αŸƒαž€αžΆαžšαžŠαžΆαž€αŸ‹αž€αŸ†αžŽαžαŸ‹αžαŸ’αžšαžΆ αž–αž·αž“αŸ’αž‘αž»αžŠαžΆαž…αŸ‹αžαžΆαžαž€αŸαžαŸ’αžšαžΌαžœαž”αžΆαž“αž”αŸ’αžšαžΎαžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž…αŸ†αžŽαžΆαžαŸ‹αžαŸ’αž“αžΆαž€αŸ‹αž“αž·αž„αž‚αž»αžŽαžœαž»αžŒαŸ’αžαž·αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž€αž˜αŸ’αžšαž·αžαžαŸ’αž–αžŸαŸ‹αž‡αžΆαž„αž€αž˜αŸ’αžšαž·αžαžŠαŸ‚αž›αž”αžΆαž“αž‡αž½αž”αŸ” T-1219 Besides the obvious instances of setting records, absolute scores are also used for rankings and qualifications for higher level meets. H-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". P-1219 -5.6896 -0.3897 -0.4505 -0.4703 -0.4755 -0.4734 -0.4636 -0.4534 -0.4559 -0.4770 -0.5122 -0.5480 -0.5656 -0.5548 -0.5311 -0.5176 -0.5248 -0.5521 -0.5879 -0.6155 -0.6275 -0.6268 -0.6145 -0.6044 -0.6063 -0.6180 -0.6381 -0.6599 -0.6741 -0.6730 -0.6504 -0.6151 -0.5775 -0.5471 -0.5252 -0.5156 -0.5183 -0.5269 -0.5375 -0.5524 -0.5667 -0.5753 -0.5676 -0.5514 -0.5416 -0.5457 -0.5598 -0.5686 -0.5647 -0.5443 -0.5199 -0.5021 -0.5059 -0.5323 -0.5649 -0.5852 -0.5915 -0.5899 -0.5889 -0.5860 -0.5728 -0.5497 -0.5285 -0.5176 -0.5272 -0.5605 -0.6027 -0.6264 -0.6164 -0.5932 -0.5724 -0.5685 -0.5866 -0.6122 -0.6207 -0.6124 -0.6014 -0.6165 -0.6604 -0.7131 -0.7463 -0.7279 -0.6754 -0.6261 -0.6049 -0.6134 -0.6357 -0.6447 -0.6234 -0.5825 -0.5459 -0.5317 -0.5450 -0.5694 -0.5911 -0.5971 -0.5899 -0.5766 -0.5685 -0.5636 -0.5652 -0.5773 -0.6008 -0.6287 -0.6571 -0.6723 -0.6623 -0.6404 -3.0687 -0.1793 -0.1941 -0.2095 -0.2273 -0.2465 -0.2647 -0.2795 -0.2911 -0.3019 -0.3156 -0.3305 -0.3463 -0.3602 -0.3717 -0.3855 -0.4036 -0.4217 -0.4304 -0.4263 -4.2744 -0.0810 -0.0823 -0.0816 -0.0790 -0.0764 -0.0761 -0.0800 -0.0874 -0.0950 -0.0992 -0.0987 -0.0952 -0.0917 -0.0902 -0.0890 -0.0863 -0.0820 -0.0792 -0.0811 -0.0873 -0.0934 -0.0941 -0.0893 -0.0833 -0.0786 -0.0780 -0.0813 -0.0862 -0.0900 -0.0910 -0.0886 -0.0856 -0.0858 -0.0897 -0.0956 -0.1003 -0.1031 -0.1043 -0.1076 -0.1130 -0.1207 -0.1316 -0.1440 -0.1520 -0.1496 -0.1364 -0.1193 -0.1070 -0.1043 -0.1127 -0.1287 -0.1462 -0.1559 -0.1533 -0.1407 -0.1264 -0.1156 -0.1109 -0.1107 -0.1099 -0.1086 -0.1108 -0.1190 -0.1266 -0.1278 -0.1228 -0.1148 -0.1085 -0.1071 -0.1129 -7.3513 -1.3642 S-608 αž…αžΌαž›αž‡αž·αžαž–αŸ’αžšαŸ‡αž’αžΆαž‘αž·αžαŸ’αž™αžŠαŸ„αž™αžαŸ’αž“αž„ αž…αžΌαž›αž‡αž·αžαž€αž„αž’αž‚αŸ’αž‚αžΈαžŠαŸ„αž™αžαžΆαž„αž–αŸ„αŸ‡ αž…αžΌαž›αž‡αž·αžαž˜αŸ’αž…αžΆαžŸαŸ‹αžŠαŸ„αž™αž—αŸ„αž‚αž—αžΆαž‚αž…αž“αŸ’αž›αŸ„αŸ‡ αž…αžΌαž›αž‡αž·αžαž†αŸ’αž–αŸ„αŸ‡αž”αžšαž›αŸ„αž€αžŠαŸ„αž™αž₯αžαž˜αŸ„αž αŸ αŸ” T-608 Approaching the sun with your back, approaching the fire with your belly, approaching the owner with wealth, approaching Nirvana with clarity. H-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".

Expected behavior

I would be expecting to get translations which make some sense, like this trained on a GPU card:

S-214252 αž€αŸ’αž”αžΆαž›αž€αŸ’αž”αžΆαž›αž”αŸ†αž–αŸαž‰: αž–αžΈ 2 αžŠαž›αŸ‹ 16 αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž‡αž˜αŸ’αžšαžΎαžŸαŸ” T-214252 Filling head count: 2-16nozzles H-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles D-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles

Environment

Additional context

After trying many different things, this seems to like there is a bug somewhere. But then again you would expect others to be running into the same thing. So if this is something someone else has seen or knows what may be causing it, all help would be very much appreciated.

gmryu commented 2 years ago

Sorry I won't be helpful. Just curious, how much is this "bugged TPU-trained loss" - "GPU-trained loss"? If they are actually the same, then what do you get from GPU-generation using the TPU-trained model? Also have you tried the official TPU example? (well, I don't know if there were one.)