Open steinst opened 2 years ago
Sorry I won't be helpful. Just curious, how much is this "bugged TPU-trained loss" - "GPU-trained loss"? If they are actually the same, then what do you get from GPU-generation using the TPU-trained model? Also have you tried the official TPU example? (well, I don't know if there were one.)
π Bug
I'm training on TPU using the km-en and ps-en datasets from the WMT 2020 shared task on parallel corpus filtering. I am using the hyperparameters from the paper, but the models don't seem to train properly on the TPUs. The translation output is just a sequence of the same symbols repeated multiple times.
Example: S-2198 ααΆαααααΆα ααα ααααααααΌα (ααααΌαααΆαααααΎαααΎαααΆααΆ "ααΆαααααΆα ααααααΆαα") ααΆαααααααααααααααααΆααΎαααααααααααααααΆααααΆααΆααααααα T-2198 Polar Deserts (also seen as "cold deserts") have similar features, except the main form of precipitation is snow rather than rain. H-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."." D-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."
The same thing happens after 2000 updates and after 20,000 updates, even though loss and perplexity seem to be going down. After 20,000 updates loss is down to 6.649 and ppl to 21.96 for the validation set, so the model seems to be training:
2022-03-31 03:57:21 | INFO | valid | epoch 034 | valid on 'valid' subset | loss 6.649 | nll_loss 4.457 | ppl 21.96 | wps 170916 | wpb 17521.6 | bsz 475.6 | num_updates 20000 | best_loss 6.649
When I train on a GPU (RTX 3060 Ti) using the same hyperparameters, on the other hand, the results are good (or at least what is expected). So this seems to have something to do with the TPU settings (or XLA?)
I have also tried this on another dataset, with similar results. So it's not the dataset. I also tried it with and without sentencepiece tokenization, also with similar results.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Training: fairseq-train \ data-bin/ \ --source-lang km --target-lang en \ --arch transformer \ --encoder-layers 5 --decoder-layers 5 \ --encoder-embed-dim 512 --decoder-embed-dim 512 \ --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \ --encoder-attention-heads 2 --decoder-attention-heads 2 \ --encoder-normalize-before --decoder-normalize-before \ --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \ --weight-decay 0.0001 \ --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \ --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \ --lr 1e-3 --stop-min-lr 1e-9 \ --max-tokens 4000 \ --update-freq 4 \ --max-update 20000 --tpu --distributed-world-size 8 --num-batch-buckets 8 \ --task translation --max-update 20000 --save-interval-updates 4000 --patience 10
Translate: fairseq-generate \ data-bin/ \ --source-lang km --target-lang en \ --gen-subset test \ --path checkpoints/checkpoint_best.pt \ --beam 5 --lenpen 1.2 \ --remove-bpe=sentencepiece \ --tpu --distributed-world-size 8 \ --sacrebleu
Translation output: S-1219 αααα ααΈααααΈααΆαααααααααααΆαααΆαααααααααααΆ αα·αααα»ααΆα αααΆαααααααΌαααΆαααααΎαααααΆααα αααΆααααααΆαααα·ααα»ααα»αααα·αααααΆααααααα·ααααααααΆαααααα·ααααααΆααα½αα T-1219 Besides the obvious instances of setting records, absolute scores are also used for rankings and qualifications for higher level meets. H-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". P-1219 -5.6896 -0.3897 -0.4505 -0.4703 -0.4755 -0.4734 -0.4636 -0.4534 -0.4559 -0.4770 -0.5122 -0.5480 -0.5656 -0.5548 -0.5311 -0.5176 -0.5248 -0.5521 -0.5879 -0.6155 -0.6275 -0.6268 -0.6145 -0.6044 -0.6063 -0.6180 -0.6381 -0.6599 -0.6741 -0.6730 -0.6504 -0.6151 -0.5775 -0.5471 -0.5252 -0.5156 -0.5183 -0.5269 -0.5375 -0.5524 -0.5667 -0.5753 -0.5676 -0.5514 -0.5416 -0.5457 -0.5598 -0.5686 -0.5647 -0.5443 -0.5199 -0.5021 -0.5059 -0.5323 -0.5649 -0.5852 -0.5915 -0.5899 -0.5889 -0.5860 -0.5728 -0.5497 -0.5285 -0.5176 -0.5272 -0.5605 -0.6027 -0.6264 -0.6164 -0.5932 -0.5724 -0.5685 -0.5866 -0.6122 -0.6207 -0.6124 -0.6014 -0.6165 -0.6604 -0.7131 -0.7463 -0.7279 -0.6754 -0.6261 -0.6049 -0.6134 -0.6357 -0.6447 -0.6234 -0.5825 -0.5459 -0.5317 -0.5450 -0.5694 -0.5911 -0.5971 -0.5899 -0.5766 -0.5685 -0.5636 -0.5652 -0.5773 -0.6008 -0.6287 -0.6571 -0.6723 -0.6623 -0.6404 -3.0687 -0.1793 -0.1941 -0.2095 -0.2273 -0.2465 -0.2647 -0.2795 -0.2911 -0.3019 -0.3156 -0.3305 -0.3463 -0.3602 -0.3717 -0.3855 -0.4036 -0.4217 -0.4304 -0.4263 -4.2744 -0.0810 -0.0823 -0.0816 -0.0790 -0.0764 -0.0761 -0.0800 -0.0874 -0.0950 -0.0992 -0.0987 -0.0952 -0.0917 -0.0902 -0.0890 -0.0863 -0.0820 -0.0792 -0.0811 -0.0873 -0.0934 -0.0941 -0.0893 -0.0833 -0.0786 -0.0780 -0.0813 -0.0862 -0.0900 -0.0910 -0.0886 -0.0856 -0.0858 -0.0897 -0.0956 -0.1003 -0.1031 -0.1043 -0.1076 -0.1130 -0.1207 -0.1316 -0.1440 -0.1520 -0.1496 -0.1364 -0.1193 -0.1070 -0.1043 -0.1127 -0.1287 -0.1462 -0.1559 -0.1533 -0.1407 -0.1264 -0.1156 -0.1109 -0.1107 -0.1099 -0.1086 -0.1108 -0.1190 -0.1266 -0.1278 -0.1228 -0.1148 -0.1085 -0.1071 -0.1129 -7.3513 -1.3642 S-608 α αΌααα·αααααα’αΆαα·αααααααααα α αΌααα·αααα’ααααΈαααααΆαααα α αΌααα·αααα αΆααααααααααΆαα ααααα α αΌααα·ααααααααααααααα₯αααα α α T-608 Approaching the sun with your back, approaching the fire with your belly, approaching the owner with wealth, approaching Nirvana with clarity. H-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".
Expected behavior
I would be expecting to get translations which make some sense, like this trained on a GPU card:
S-214252 ααααΆαααααΆαααααα: ααΈ 2 ααα 16 αααααΆαααααααΎαα T-214252 Filling head count: 2-16nozzles H-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles D-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles
Environment
Additional context
After trying many different things, this seems to like there is a bug somewhere. But then again you would expect others to be running into the same thing. So if this is something someone else has seen or knows what may be causing it, all help would be very much appreciated.