Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
490 stars 76 forks source link

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System #14

Closed george2seven closed 3 years ago

george2seven commented 3 years ago

Hello,

Our system is being validated against both "wmt-large-da-estimator-1719" and "wmt-large-hter-estimator" estimators with the same translations dataset, of course (70k+ translations).

The two estimators give completely opposite results. The "da" estimator is placing our MT system in "...the bottom 25%" while the "HTER" estimator returns a "top 25%" score.

I know this is not a technical issue, but can you please provide some additional information on how we might be able to interpret those types of results?

Thank you very much

ricardorei commented 3 years ago

This issue label is exactly for this type of questions! I am happy to help

What are the scores exactly?

Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...

ricardorei commented 3 years ago

You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?

george2seven commented 3 years ago

This issue label is exactly for this type of questions! I am happy to help

What are the scores exactly?

Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...

Please find below the results:

  wmt-large-da-estimator-1719 wmt-large-hter-estimator emnlp-base-da-ranker
Score -0.21418807 0.212977027 0.145221945
Translations Count (same MT) 70544 70544 70544

Thanks for the support!

george2seven commented 3 years ago

You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?

Unfortunately we don't have within our team experience with this type of computation but I will ask our engineers to have a look.

ricardorei commented 3 years ago

Ok, the scores make sense! HTER and DA's have different scales. HTER is a measure that you want to minimize. It reflects the effort required to "correct" the translation output in order to be semantically equivalent to the reference (higher HTER reflects more effort).

DA is a continuous scale of "how good is a translation" (a high DA score means that the translation is good).

Both models are telling you that your MT is not good. For a SOTA MT system, you should expect your HTER score to be close to 0 while the DA score should be between 0.6 and 1

Its all here: https://unbabel.github.io/COMET/html/models.html

ricardorei commented 3 years ago

If you want to read more about HTER: Snover et al., 2006

and DA's: Graham et al., 2013

george2seven commented 3 years ago

Thank you very much Ricardo! Makes sense now.