AIPHES / emnlp19-moverscore

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
MIT License
192 stars 31 forks source link

Cannot reproduce your results #1

Closed george-philipp closed 4 years ago

george-philipp commented 4 years ago

Hey there,

Thank you for putting up this repo. I quickly run your method, the word mover distance with unigram, on the WMT17 de-en language pair, and the pearson correlation is only 0.645, quite worse from what you report in the paper. Can you double check the code release?

Also, it takes me 8 mins to run on these 560 sentences. Is this expected or am I doing something wrong?

andyweizhao commented 4 years ago

Thank you for your interest. This is a preliminary web service with major implementation included. To reproduce the numbers in paper, additional steps are required (need to slightly change the code.. but I will add them soon):

  1. use the BERT model fine-tuned on MNLI instead of the origin version.
  2. simply remove the subwords that contain "##" in the unigram setting, because the latter part like "ing" and "ed" is often nothing with the core meaning like "watching" and "watched".

Due to time constraints, the current version of web service supports CPU environment only, but it will have more features released in the next update.

george-philipp commented 4 years ago

hi @andyweizhao thank you for your swift reply.

I understand that the code base is not using the MNLI model. However, the correlation I computed is still worse than those shown in the BERT+PMEANS row.

By the way, do you apply this trick (removing subwords) for all of the studies in your paper? For example, do you also use this trick for the HMD + BERT in table 5?

andyweizhao commented 4 years ago

Hi George,

I forgot one additional step: TF-IDF weights are required. I will try to fix these issues this week.

When combining BERT-MNLI, TF-IDF and removing subwords, you will see the similar numbers as the ones below in my server (wmd-unigram): de-en {'pearson': 0.7082533292728657}

I used this trick in all tasks and most of language pairs except "fi-en" and "lv-en"..

andyweizhao commented 4 years ago

I just updated the repo to support the reproducibility on MT. I will close the current issue, please create new ones if you have additional questions.

george-philipp commented 4 years ago

Wow thank you for making this happen. This is very helpful.

I try to run the codes but seems like some of the files are missing, namely the translation data. Could you be so kind to also upload them?

andyweizhao commented 4 years ago

Sure thing. I just uploaded them.

Alex-Fabbri commented 4 years ago

Hi, thanks for the great work! Following up on reproducing results, when I run examples/run_MT.py with v1 of moverscore I'm able to reproduce the results "WMD-1+BERTMNLI+PMeans" from the readme, but when I run v2 I get different results than "WMD-2+BERTMNLI+PMeans" :

cs-en pearson: 0.67 de-en pearson: 0.66 ru-en pearson: 0.71 tr-en pearson: 0.73 zh-en pearson: 0.70

I'm attaching the result of running pip freeze > requirements.txt requirements.txt

Do you have any ideas on the cause of the difference?

Thank you!

andyweizhao commented 4 years ago

Hi Alex, For reproducing results, moverscore_v1 is all you need, i.e., set the parameter "n_gram" to 1 for WMD-1 and set it to 2 for WMD-2. However, the running speed of this version is creepy. I made an easygoing version called moverscore_v2 for accelerating, e.g., use DistilledBERT instead of BERT, make codes efficient and remove WMD-2, which sadly drops a little in performance but still correlates well with human judgments. Choose these two versions sensibly on purpose :)

Alex-Fabbri commented 4 years ago

That makes sense. Thanks a lot for the clarification!