Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
500 stars 78 forks source link

Which scale is COMET? [0,1]? #38

Closed i55code closed 2 years ago

i55code commented 2 years ago

Is COMET on a 0-1 scale? Is it normalized to [0,1]? I saw on the paper of uncertainty-aware COMET, there is a COMET score that is bigger than 1, how is it so? Is that normal/frequent?

ricardorei commented 2 years ago

Hi @i55code most COMET models (the only exception is our ranking model) are trained with z-scores from WMT shared tasks. Z-scores are unbounded thus COMET predictions can go over 1 and below 0.

Having negative scores it's actually more common than having scores above 1 but sometimes this happens if a translation is really good.

i55code commented 2 years ago

Thank you so much for explaining! This helps a lot:) @ricardorei

I also hope to ask two more questions:

  1. Is it possible that the data does not have "src" field? Does using different "src" have a big impact in COMET? BLEU does not require "src".
  2. The score format is [seg_scores, sys_score], when we are talking about COMET score, it is the sys_score, right? What does seq_scores do? In what cases, would knowing seq_scores be helpful?

Thanks!

ricardorei commented 2 years ago
  1. COMET compares the MT with both source and reference in a shared embeddings space. If the source is completely different ideally this will drop your score. COMET metrics are really designed with the intent of using the source and some metrics (the reference-free metrics) only require the source! We don't have any model that uses only the reference. Are you trying to use COMET in a particular use-case where you don't have access to the source?

  2. the sys_score is the average of the seg_scores. The seg_scores are basically the quality assessments for each individual hypothesis (MT). They might be useful for some applications where we are actually interested in comparing two (or more) systems at the segment level.

i55code commented 2 years ago

Thank you, @ricardorei, this helps a lot. I am trying to evaluate in a case when the source is not available. In my work, I am interested to know how well a translation is given some reference text. If there is future development on source-free metrics, feel free to let me know.

One more question I have is on the language list that are covered, are these languages refer to the source language? If, say, a language is not covered, how much can people trust the output? I read the "uncertainty-aware" paper, is there a simple way to print out the bounds from COMET's repo?

ricardorei commented 2 years ago

The language list is basically source and target.

Let's imagine you are evaluating translations from Maori into English (or vice-versa). Maori is not covered in XLMR and the results might be unreliable. This is especially important if the target language is the uncovered language and/or if using a reference-free metric.

Regarding the question "how much people can trust the output?" this is very hard to answer. In WMT20 we evaluated the model on Inuktitut and the results were good, yet Tom Kocmi from Microsoft tested several COMET models on several languages and he reported that it was not stable for languages not covered in XLM-R (#18).

I read the "uncertainty-aware" paper, is there a simple way to print out the bounds from COMET's repo? The code used in that paper is based on a previous COMET codebase and you can find everything here: UA_COMET

In this current codebase, you can only replicate the Monte Carlo Dropout results using the --mc_dropout flag. This should give you reasonable bounds without having to run several models.

i55code commented 2 years ago

Thank you so much, @ricardorei . I mainly work with low resource languages, and COMET sometimes will give surprising good result when there is low BLEU. So I would be interested in getting a good bounds. Thank you, I will try UA_COMET.

Thank you for your explanation, I really appreciate it and have a good weekend!

ricardorei commented 2 years ago

Thanks, @i55code have a nice weekend!

i55code commented 2 years ago

@ricardorei Hi Ricardo, I have one last question to ask. For --mc_dropout, I saw the output on command line, but is there a simple way to invoke it through Python interface. I see that there is model.predict(...), and model.mc_dropout(), what is a simple command in python to get bounds? Thanks!

ricardorei commented 2 years ago

Its an argument of the predict function: mean_scores, std_scores, sys_score = model.predict(data, batch_size=8, gpus=1, mc_dropout=30)

i55code commented 2 years ago

Thank you, @ricardorei Ricardo! Really learn a lot today. I hope to use COMET more in my future research. Take good care and stay safe!

ricardorei commented 2 years ago

Thanks, @i55code feel free to reach out for questions!