Closed rbawden closed 4 years ago
Yes, this is too high and the problem is that I used some localisation data for testing in early models. Tatoeba as the name is misleading and a mistake. I am pretty sure that those test sets are taken from GNOME and this overlaps with data from the Ubuntu localisation files, which are included in the training data. Therefore, the high scores in this case. There will be similar cases for other language pairs. Sorry. But the scores are definitely after merging subword units but in this case they might still come from tokenised text using multibleu. Now, I always run on detokenized results and sacrebleu. Or better, the new models apply SentencePiece and no tokenisation. Hope this explains the situation.
Ok, thank you for the explanation!
Hello! I noticed that the BLEU score for ta-en is 89.1, which seems a little too high. Could this be a bug? Also, were the BLEU scores calculated on the de-tokenised outputs or the BPE-ed ones in opus-2019-12-05.test.txt?
Thank you in advance!