Closed george2seven closed 3 years ago
This issue label is exactly for this type of questions! I am happy to help
What are the scores exactly?
Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...
You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?
This issue label is exactly for this type of questions! I am happy to help
What are the scores exactly?
Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...
Please find below the results:
wmt-large-da-estimator-1719 | wmt-large-hter-estimator | emnlp-base-da-ranker | |
---|---|---|---|
Score | -0.21418807 | 0.212977027 | 0.145221945 |
Translations Count (same MT) | 70544 | 70544 | 70544 |
Thanks for the support!
You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?
Unfortunately we don't have within our team experience with this type of computation but I will ask our engineers to have a look.
Ok, the scores make sense! HTER and DA's have different scales. HTER is a measure that you want to minimize. It reflects the effort required to "correct" the translation output in order to be semantically equivalent to the reference (higher HTER reflects more effort).
DA is a continuous scale of "how good is a translation" (a high DA score means that the translation is good).
Both models are telling you that your MT is not good. For a SOTA MT system, you should expect your HTER score to be close to 0 while the DA score should be between 0.6 and 1
Its all here: https://unbabel.github.io/COMET/html/models.html
If you want to read more about HTER: Snover et al., 2006
and DA's: Graham et al., 2013
Thank you very much Ricardo! Makes sense now.
Hello,
Our system is being validated against both "wmt-large-da-estimator-1719" and "wmt-large-hter-estimator" estimators with the same translations dataset, of course (70k+ translations).
The two estimators give completely opposite results. The "da" estimator is placing our MT system in "...the bottom 25%" while the "HTER" estimator returns a "top 25%" score.
I know this is not a technical issue, but can you please provide some additional information on how we might be able to interpret those types of results?
Thank you very much