Could not reproduce FloresV1 BLEU scores

I have tried to reproduce FLORESV1 BLEU scores using the reproduce.sh script and I am off by a significant amount.

Table 3 of "The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English" (Guzman, Chen et al, 2019) shows the following scores

After fixing data and fairseq issues in #40, I ran floresv1/reproduce.sh and got the following scores

Supervised NE-EN: 5.69 BT-1 EN-NE: 6.65 BT-2 NE-EN: 12.83

English-Nepali is within range but Nepali-English is pretty far from the scores presented (2+ BLEU off)

I ran on a single RTX8000 and only changed the max_tokens from 4000 to 16000. This is because train.py would change gpu=4 to update_freq=4 and the RTX8000 has enough memory to accomodate a batch size of 16000.

Do you have any advice on possible hyperparameter tuning that might reproduce the initial numbers?

facebookresearch / flores

Could not reproduce FloresV1 BLEU scores #43