Having troubles reproducing results for m2m100 1.2b

dchaplinsky commented 11 months ago

Hello @jorgtied!

I'm trying to reproduce the reported results for eng-ukr language pair for m2m100 on flores200 dataset but the score I get is much lower (26.8->21.0).

My setup is: cTranslate2, this model and HF's evaluate (the code is available here. The dataset is the same (Flores200, devtest).

My main suspects are:

Lower quality of the quantised m2m100 model
Different settings for the text generation (I'm using beams=5)
Different settings for BLEU scorer (ngrams, etc).

I've browsed the repos I found on opus-mt leaderboard and other seemingly relevant repos from Helsinki-NLP account. I also glimpsed through the main paper.

Could you please advise on the following things?

Where I can find generation/evaluation settings/code for the leaderboard?
Is there a file with the individual metrics per sentence pair?
Anything else you might remember or find relevant.

Thanks in advance!

jorgtied commented 11 months ago

I used the native transformers library for decoding the testsets and beam size 1 (if I remember correctly). BLEU scores are computed with sacrebleu and default settings. There are no individual scores per sentence pair.

dchaplinsky commented 10 months ago

Thanks. No source code left for the eval, so I can dig it myself rather than bothering you?

Helsinki-NLP / OPUS-MT-leaderboard

Having troubles reproducing results for m2m100 1.2b #3