I have tried to reproduce FLORESV1 BLEU scores using the reproduce.sh script and I am off by a significant amount.
Table 3 of "The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English" (Guzman, Chen et al, 2019) shows the following scores
After fixing data and fairseq issues in #40, I ran floresv1/reproduce.sh and got the following scores
English-Nepali is within range but Nepali-English is pretty far from the scores presented (2+ BLEU off)
I ran on a single RTX8000 and only changed the max_tokens from 4000 to 16000. This is because train.py would change gpu=4 to update_freq=4 and the RTX8000 has enough memory to accomodate a batch size of 16000.
Do you have any advice on possible hyperparameter tuning that might reproduce the initial numbers?
I have tried to reproduce FLORESV1 BLEU scores using the
reproduce.sh
script and I am off by a significant amount.Table 3 of "The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English" (Guzman, Chen et al, 2019) shows the following scores
After fixing data and fairseq issues in #40, I ran
floresv1/reproduce.sh
and got the following scoresSupervised NE-EN: 5.69 BT-1 EN-NE: 6.65 BT-2 NE-EN: 12.83
English-Nepali is within range but Nepali-English is pretty far from the scores presented (2+ BLEU off)
I ran on a single RTX8000 and only changed the
max_tokens
from 4000 to 16000. This is becausetrain.py
would changegpu=4
toupdate_freq=4
and the RTX8000 has enough memory to accomodate a batch size of 16000.Do you have any advice on possible hyperparameter tuning that might reproduce the initial numbers?