Closed i55code closed 2 years ago
Hi @i55code most COMET models (the only exception is our ranking model) are trained with z-scores from WMT shared tasks. Z-scores are unbounded thus COMET predictions can go over 1 and below 0.
Having negative scores it's actually more common than having scores above 1 but sometimes this happens if a translation is really good.
Thank you so much for explaining! This helps a lot:) @ricardorei
I also hope to ask two more questions:
Thanks!
COMET compares the MT with both source and reference in a shared embeddings space. If the source is completely different ideally this will drop your score. COMET metrics are really designed with the intent of using the source and some metrics (the reference-free metrics) only require the source! We don't have any model that uses only the reference. Are you trying to use COMET in a particular use-case where you don't have access to the source?
the sys_score is the average of the seg_scores. The seg_scores are basically the quality assessments for each individual hypothesis (MT). They might be useful for some applications where we are actually interested in comparing two (or more) systems at the segment level.
Thank you, @ricardorei, this helps a lot. I am trying to evaluate in a case when the source is not available. In my work, I am interested to know how well a translation is given some reference text. If there is future development on source-free metrics, feel free to let me know.
One more question I have is on the language list that are covered, are these languages refer to the source language? If, say, a language is not covered, how much can people trust the output? I read the "uncertainty-aware" paper, is there a simple way to print out the bounds from COMET's repo?
The language list is basically source and target.
Let's imagine you are evaluating translations from Maori into English (or vice-versa). Maori is not covered in XLMR and the results might be unreliable. This is especially important if the target language is the uncovered language and/or if using a reference-free metric.
Regarding the question "how much people can trust the output?" this is very hard to answer. In WMT20 we evaluated the model on Inuktitut and the results were good, yet Tom Kocmi from Microsoft tested several COMET models on several languages and he reported that it was not stable for languages not covered in XLM-R (#18).
I read the "uncertainty-aware" paper, is there a simple way to print out the bounds from COMET's repo? The code used in that paper is based on a previous COMET codebase and you can find everything here: UA_COMET
In this current codebase, you can only replicate the Monte Carlo Dropout results using the --mc_dropout
flag. This should give you reasonable bounds without having to run several models.
Thank you so much, @ricardorei . I mainly work with low resource languages, and COMET sometimes will give surprising good result when there is low BLEU. So I would be interested in getting a good bounds. Thank you, I will try UA_COMET.
Thank you for your explanation, I really appreciate it and have a good weekend!
Thanks, @i55code have a nice weekend!
@ricardorei Hi Ricardo, I have one last question to ask. For --mc_dropout, I saw the output on command line, but is there a simple way to invoke it through Python interface. I see that there is model.predict(...), and model.mc_dropout(), what is a simple command in python to get bounds? Thanks!
Its an argument of the predict function:
mean_scores, std_scores, sys_score = model.predict(data, batch_size=8, gpus=1, mc_dropout=30)
Thank you, @ricardorei Ricardo! Really learn a lot today. I hope to use COMET more in my future research. Take good care and stay safe!
Thanks, @i55code feel free to reach out for questions!
Is COMET on a 0-1 scale? Is it normalized to [0,1]? I saw on the paper of uncertainty-aware COMET, there is a COMET score that is bigger than 1, how is it so? Is that normal/frequent?