Strange behaviour in translation task on TPU

🐛 Bug

I'm training on TPU using the km-en and ps-en datasets from the WMT 2020 shared task on parallel corpus filtering. I am using the hyperparameters from the paper, but the models don't seem to train properly on the TPUs. The translation output is just a sequence of the same symbols repeated multiple times.

Example: S-2198 វាលខ្សាច់នៅតំបន់ប៉ូល (ត្រូវបានគេមើលឃើញថាជា "វាលខ្សាច់ត្រជាក់") មានលក្ខណៈស្រដៀងគ្នាលើកលែងតែភ្លៀងធ្លាក់ជាជាងភ្លៀង។ T-2198 Polar Deserts (also seen as "cold deserts") have similar features, except the main form of precipitation is snow rather than rain. H-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."." D-2198 -0.1692517250776291 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """""""""""""""""""""""""""""."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."."

The same thing happens after 2000 updates and after 20,000 updates, even though loss and perplexity seem to be going down. After 20,000 updates loss is down to 6.649 and ppl to 21.96 for the validation set, so the model seems to be training:

When I train on a GPU (RTX 3060 Ti) using the same hyperparameters, on the other hand, the results are good (or at least what is expected). So this seems to have something to do with the TPU settings (or XLA?)

I have also tried this on another dataset, with similar results. So it's not the dataset. I also tried it with and without sentencepiece tokenization, also with similar results.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Training: fairseq-train \ data-bin/ \ --source-lang km --target-lang en \ --arch transformer \ --encoder-layers 5 --decoder-layers 5 \ --encoder-embed-dim 512 --decoder-embed-dim 512 \ --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \ --encoder-attention-heads 2 --decoder-attention-heads 2 \ --encoder-normalize-before --decoder-normalize-before \ --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \ --weight-decay 0.0001 \ --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \ --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \ --lr 1e-3 --stop-min-lr 1e-9 \ --max-tokens 4000 \ --update-freq 4 \ --max-update 20000 --tpu --distributed-world-size 8 --num-batch-buckets 8 \ --task translation --max-update 20000 --save-interval-updates 4000 --patience 10

Translate: fairseq-generate \ data-bin/ \ --source-lang km --target-lang en \ --gen-subset test \ --path checkpoints/checkpoint_best.pt \ --beam 5 --lenpen 1.2 \ --remove-bpe=sentencepiece \ --tpu --distributed-world-size 8 \ --sacrebleu

Translation output: S-1219 ក្រៅពីករណីជាក់ស្ដែងនៃការដាក់កំណត់ត្រា ពិន្ទុដាច់ខាតក៏ត្រូវបានប្រើសម្រាប់ចំណាត់ថ្នាក់និងគុណវុឌ្ឍិសម្រាប់កម្រិតខ្ពស់ជាងកម្រិតដែលបានជួប។ T-1219 Besides the obvious instances of setting records, absolute scores are also used for rankings and qualifications for higher level meets. H-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-1219 -0.16661687195301056 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". P-1219 -5.6896 -0.3897 -0.4505 -0.4703 -0.4755 -0.4734 -0.4636 -0.4534 -0.4559 -0.4770 -0.5122 -0.5480 -0.5656 -0.5548 -0.5311 -0.5176 -0.5248 -0.5521 -0.5879 -0.6155 -0.6275 -0.6268 -0.6145 -0.6044 -0.6063 -0.6180 -0.6381 -0.6599 -0.6741 -0.6730 -0.6504 -0.6151 -0.5775 -0.5471 -0.5252 -0.5156 -0.5183 -0.5269 -0.5375 -0.5524 -0.5667 -0.5753 -0.5676 -0.5514 -0.5416 -0.5457 -0.5598 -0.5686 -0.5647 -0.5443 -0.5199 -0.5021 -0.5059 -0.5323 -0.5649 -0.5852 -0.5915 -0.5899 -0.5889 -0.5860 -0.5728 -0.5497 -0.5285 -0.5176 -0.5272 -0.5605 -0.6027 -0.6264 -0.6164 -0.5932 -0.5724 -0.5685 -0.5866 -0.6122 -0.6207 -0.6124 -0.6014 -0.6165 -0.6604 -0.7131 -0.7463 -0.7279 -0.6754 -0.6261 -0.6049 -0.6134 -0.6357 -0.6447 -0.6234 -0.5825 -0.5459 -0.5317 -0.5450 -0.5694 -0.5911 -0.5971 -0.5899 -0.5766 -0.5685 -0.5636 -0.5652 -0.5773 -0.6008 -0.6287 -0.6571 -0.6723 -0.6623 -0.6404 -3.0687 -0.1793 -0.1941 -0.2095 -0.2273 -0.2465 -0.2647 -0.2795 -0.2911 -0.3019 -0.3156 -0.3305 -0.3463 -0.3602 -0.3717 -0.3855 -0.4036 -0.4217 -0.4304 -0.4263 -4.2744 -0.0810 -0.0823 -0.0816 -0.0790 -0.0764 -0.0761 -0.0800 -0.0874 -0.0950 -0.0992 -0.0987 -0.0952 -0.0917 -0.0902 -0.0890 -0.0863 -0.0820 -0.0792 -0.0811 -0.0873 -0.0934 -0.0941 -0.0893 -0.0833 -0.0786 -0.0780 -0.0813 -0.0862 -0.0900 -0.0910 -0.0886 -0.0856 -0.0858 -0.0897 -0.0956 -0.1003 -0.1031 -0.1043 -0.1076 -0.1130 -0.1207 -0.1316 -0.1440 -0.1520 -0.1496 -0.1364 -0.1193 -0.1070 -0.1043 -0.1127 -0.1287 -0.1462 -0.1559 -0.1533 -0.1407 -0.1264 -0.1156 -0.1109 -0.1107 -0.1099 -0.1086 -0.1108 -0.1190 -0.1266 -0.1278 -0.1228 -0.1148 -0.1085 -0.1071 -0.1129 -7.3513 -1.3642 S-608 ចូលជិតព្រះអាទិត្យដោយខ្នង ចូលជិតកងអគ្គីដោយខាងពោះ ចូលជិតម្ចាស់ដោយភោគភាគចន្លោះ ចូលជិតឆ្ពោះបរលោកដោយឥតមោហ៍ ។ T-608 Approaching the sun with your back, approaching the fire with your belly, approaching the owner with wealth, approaching Nirvana with clarity. H-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".". D-608 -0.16297262907028198 " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " """"""""""""""""""""""".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".".

Expected behavior

I would be expecting to get translations which make some sense, like this trained on a GPU card:

S-214252 ក្បាលក្បាលបំពេញ: ពី 2 ដល់ 16 សម្រាប់ជម្រើស។ T-214252 Filling head count: 2-16nozzles H-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles D-214252 -0.27322813868522644 Filling head: 2 to 16 heads for filling nozles

Environment

fairseq Version: 1.0.0a0+06c65c8
PyTorch Version: 1.11.0 (and torch-xla 1.11)
OS (e.g., Linux): linux
How you installed fairseq: pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.7.10
GPU models and configuration: TPU v3-8
Any other relevant information: I have tried various things, always with the same results: Running this both on TPU-VM and a TPU node used in a Google Compute Engine VM Instance. I have also tried with and without setting MKL_THREADING_LAYER=GNU and MKL_SERVICE_FORCE_INTEL=1. I have tried updating mkl-service to 2.4.0, and tried runnin this on PyTorch version 1.9.

Additional context

After trying many different things, this seems to like there is a bug somewhere. But then again you would expect others to be running into the same thing. So if this is something someone else has seen or knows what may be causing it, all help would be very much appreciated.

facebookresearch / fairseq