About ALMA's very bad benchmark results

YUCHEN005 / GenTranslate

Code for paper "GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators"

Apache License 2.0

225 stars 6 forks source link

About ALMA's very bad benchmark results #1

Closed qwopqwop200 closed 9 months ago

qwopqwop200 commented 9 months ago

Thank you for your amazing work.

I have a question because ALMA's results are very poor in the paper. When I look at the results for WMT 20 en -> zh in Table 5, it shows a very low score (11.3) for ALMA, but this is very strange. ALMA gets a high score (39.3) on WMT 22 and even the WMT 20 test dataset used in the benchmark is used for training ALMA.

This is my personal guess, but it seems that the prompts template used for ALMA's inference were not used.

qwopqwop200 commented 9 months ago

prompt example: "Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"

YUCHEN005 commented 9 months ago

Hi, I also find it strange when I reproduce them, but I directly follow their official infer commands and prompt template (same as your comment above), I also tried different `max_new_tokens' from default 20 to 150.

So I am also confused what's going with it, willing to communicate with you if any new findings. Thank you.