Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
502 stars 78 forks source link

Add a check if language is supported by model #18

Open kocmitom opened 3 years ago

kocmitom commented 3 years ago

Hello, In my ongoing evaluation of metrics, I have found out, that COMET (especially source based) is very unpredictable when it evaluates a language that is not supported by XLM model. This can easily happen because there is no list of supported languages for COMET on your git page and users would need to investigate this issue to XLM paper/git.

Would it be possible to add a check that language is supported (and thus ask for the language code when evaluating)?

Here I share my findings with you: Here is a graph for COMET source-based (QE), on Y-axis are human deltas and on X-axis are COMET deltas. The green dots around COMET delta 0 (it isn't exactly 0) are for language pairs where one of the languages is not supported by XLM, you can see that humans did found a difference for given languages but COMET was chaotic (other metrics doesn't have this problem).

image

ricardorei commented 3 years ago

Hey Tom, Ok that seems a good idea. I am curious about the languages you tested. We only tested one language that was not covered by XLMR, Inuktitut, for the shared task. I am curious about different language that you might have tested

kocmitom commented 3 years ago

I am not sure if I can share this (as there are languages that we are working on but not supporting at this moment). However, in the mix, I have all languages that are supported by bing.com/translator ... for example, the most bottom green point in the plot is Maori into English.

Also when I removed languages not covered by XLMR, this vertical line disappeared and the COMET behaves as expected.

jlmeunier commented 2 years ago

I have the same issue (as kocmitom). I wanted to use COMET on en-fr pair and got scores, all around 0.05 if I remember correctly. Actually, since en-fr was not in-scope at WMT for some time, my guess is that it is not covered. Is-this correct? And thanks for contributing COMET! JL

ricardorei commented 2 years ago

Hi @jlmeunier en-fr has not been used for a while in WMT but a lot of "out-of-english" pairs are and French as a target is also extensively evaluated through the de-fr language pair.

That is definitely a language pair I would expect COMET to be good at.

In Tom's paper it seems like COMET has a 98.4 accuracy for en-fr (table 9).

Can you tell me more about which model are you using or what are you trying to evaluate? maybe I can help you figure out the scores...

(Meanwhile, this feature request is sitting here for a while... I'll try to add a language flag for the next release that at least checks if the encoder model supports the language)

jlmeunier commented 2 years ago

Hi @jlmeunier en-fr has not been used for a while in WMT but a lot of "out-of-english" pairs are and French as a target is also extensively evaluated through the de-fr language pair.

That is definitely a language pair I would expect COMET to be good at.

In Tom's paper it seems like COMET has a 98.4 accuracy for en-fr (table 9).

Can you tell me more about which model are you using or what are you trying to evaluate? maybe I can help you figure out the scores...

(Meanwhile, this feature request is sitting here for a while... I'll try to add a language flag for the next release that at least checks if the encoder model supports the language)

oh, good news, then! I was using wmt21-comet-mqm. It gives 0.0464 While BLEU=36.4 and CHRF=62.2 but the comet model by default gives a score that makes sense: 0.6559 I can share my source/ref/hypothesis if you want (2000 lines each)

kocmitom commented 2 years ago

Hi @jlmeunier, Principally 0.0464 doesn't have to be a bad score in general. Absolute scores via automatic metrics are meaningless (what does 31 BLEU mean without context? it can be both awesome score for News EN-Finnish or really bad score for EN-French), and pretrained metrics only amplify it by using different scales for different languages and especially different domains. With COMET, we for example see scores over 100 and for some specific domains and languages we even get negative numbers, but when used in pair-wise setting, these numbers still can properly pinpoint better system when later we evaluate it with humans.

jlmeunier commented 2 years ago

hi @kocmitom , thanks for the comment. It's a good point. But I have trouble understanding why it is so. Especially for DA: it is an absolute score for a given annotator, correct? I would expect that the normalised DA scores of very good translations would be near 1.0. So the regression model should also end at a score near 1.0. Hence my intuition about the scores. At least the DA-based ones. But I'm not familiar with MQM. (Nor comfortable with the computation of the "z-score")

ricardorei commented 2 years ago

@jlmeunier Let me try to give an example.

Let's think of annotator A and annotator B for fr-en that are going to annotate the same test set using the same guidelines but annotator B is much more "demanding".

When both see the sentence "Bonjour le monde!" being translated into "Hello world" while annotator A assigns a high score (lets say 90) annotator B assigns a score of 50 (because of missing punctuation).

After annotating the entire data annotator A has an average score of 70 while annotator B has an average of 50. The data is the same and the scores have a linear correlation but the scales are different because they reflect different opinions on "what is a good translation" (even with the same guidelines!). What the z-score is going to do is to normalize the data such that the distribution of scores for each annotator is centered in 0. This way the scores become more comparable.

Nonetheless, this is a theoretical example... what happens is that in WMT most annotators do not annotate the same segments, and depending on the language pair and culture annotators have different "demand levels". Even after normalizing the data, the distributions differ!

Because of that, neural metrics learn different scales for different language pairs and the scales are sometimes "unpredictable".

The logic for MQM data is similar but what we observed is that there are fewer "outliers" and variations in the MQM data. This means that most scores are close to 0. This means that the models trained in the MQM data do not risk predictions that are far from 0. The range of scores for these models is very short... but rankings are still correct like Tom was saying.

I hope that this helps!

Cheers, Ricardo

BramVanroy commented 1 year ago

I am stumbling over the same issue. I had, naively, assumed good performance for all languages that XLM-R was trained on (through transfer learning) but that does not hold.

The following example for English-Turkish returns a negative score, for instance, even though there are definitely some similarities between the MT and the ref (I am told, I am not a Turkish speaker).

from comet import download_model, load_from_checkpoint

def main():
    model_path = download_model("wmt21-comet-da")
    model = load_from_checkpoint(model_path)

    data = [{
        "src": "It is when they fit in with general assumptions and trends that they take on meaning.",
        "mt": "Genel varsayımlara ve eğilimlere uyduğunda anlama gelirler.",
        "ref": "Olaylar ancak genel varsayımlar veya eğilimler içerisinde bir yere sahip olduklarında bir anlam kazanır."
    }]

    print(model.predict(data, gpus=1))

if __name__ == '__main__':
    main()

Perhaps this page can include more information about the supported languages, i.e. the languages that comet was actually finetuned on?

If this just means going through the results papers and checking which datasets the shared tasks were about, I can try to add that information and put it up as a PR. Would that help?

ricardorei commented 1 year ago

@BramVanroy I am not sure which are the score ranges for turkish, Having a negative score sometimes does not mean that the translation is "bad". Some languages and some domains seem to have a slightly different score scale than others. Nonetheless the resulting rankings should be accurate.

Have you looked at a distribution of scores for turkish?

I am actually working on a new release v2.0 which will include the latest models from WMT22 (which seem to be better across several languages and domains). I definitely add that information on the new documentation.

BramVanroy commented 1 year ago

I've tasked students use COMET alongside other metrics (BLEU, ChrF, BERTScore, BLEURT) and then discuss the results with their own evaluation for a couple of given MT, source, references. I've had a couple of reports that for some of the students COMET was returning a lot of very high or very low scores (many 0's). I have not had a chance to look into their reports in detail but the issue is often the same (100 or 0 scores, or very low/very high). They work with languages such as Turkish, Kazakh, Romanian, Russian.

It's reassuring to hear that it is normal that the score scales differ between languages, which may already explain something. I will investigate their final assignments in more detail and get back to you once I know more and if the problem is widespread.

But for now, am I correct in assuming that in theory COMET is intended to be used for all languages in XLM-R (if using an XLM-R based model). Or is it only recommend for the languages that it was finetuned on (from the WMT)?

ricardorei commented 1 year ago

@BramVanroy in theory COMET is intended to be used for all languages in XLM-R and in practise by looking at correlations it correlates well for multiple languages (even unseen ones).

Lets look at these examples I produced for the upcoming documentation:

chart

Here I am comparing BLEU, COMET-QE-20 (wmt20-comet-qe-da), COMET20 (wmt20-comet-da), CometKiwi (wmt22-cometkiwi-da, a new model for reference-free evaluation resulting from our participation in the QE shared task) and COMET22 (wmt22-comet-da a new model that will replace the previous default one wmt20-comet-da). I am using the SQM annotation used to evaluate the systems participating in the WMT 2022 shared task.

In the above plot you can observe that for for language such as Ukrainian and Croatian (language that COMET was never trained for) the Kendall tau correlation is quite good and on pair with en-cs for which we had training data.

As a sanity check you can also look at annotations performed in the WMT 2021 shared task for low-resource language pairs (Bengali-Hindi, Hindi-Bengali, Xhosa-Zulu, Zulu-Xhosa):

chart (1)

Correlations for these languages are lower but generally positive.

Nonetheless, the score ranges like I was saying are hard to interpret and thats why looking at absolute scores can be misleading. We can definitely tell that one system is performing better than the other but it's hard to define buckets of quality.

Side note: All models that we will release for the new version will output scores between 0 and 1. This will not solve all issues but it will help interpretability from a users perspective. In general, interpretability of COMET scores is something we are actively working and that we want to improve

BramVanroy commented 1 year ago

Hello Ricardo, apologies for getting back to you so late! Thanks a lot for the information. I assume the y-axis is the kendall tau correlation between the metrics and the human annotations? I'm a bit confused why the correlations are so low for the first language pair (English-German; en-de). That should seem like one of the "easiest" pairs or am I mistaken? Also, if these are correlations on the y-axis, some are indeed quite low (<0.2) in which case significance levels may also be helpful.

We're looking forward to using the new models!

ricardorei commented 1 year ago

Hi @BramVanroy, Yes exactly, the y-axis is kendall tau between metrics and human annotations. The correlations for some high resource language pairs can be a bit miss-leading because you will have a lot of high quality translations and evaluation becomes much more nuanced and even humans tend to have low levels of correlation. You can also see that for en-zh.

You can look at the full results here. First tab are correlations with MQM annotations (performed by google for zh-en and en-de and by Unbabel for en-ru), second SQM correlations (annotations were performed by Microsoft for the 2022 General MT task), and third tab are DA's (a mix between data from 2021 and 2022).

The tabs are ordered according to "annotation quality". For MQM and SQM I used statistical testing for all results and everything in bold is statistically significant. The statistical tests used are the same ones used in 2022 WMT Metrics task.

BramVanroy commented 1 year ago

Thanks for sharing that data sheet @ricardorei. Very interesting and useful!

The correlations for some high resource language pairs can be a bit miss-leading because you will have a lot of high quality translations and evaluation becomes much more nuanced and even humans tend to have low levels of correlation.

While this "makes sense", it is quite fascinating and leads to almost philosophical questions about how convincing MT evaluation can be in a few years time for high-resource languages, if even humans cannot agree well in such scenarios.