if tgt is same with src, the score is still high

ZeroneBo commented 6 months ago

Question

My mt system mt_1 translates some sentences producing bad translations that are same with the source text, but it has a higher score. My another translation system mt_2 translates the same source text and only partially translates it, yet it has a lower score.

Is such a comet score credible? Do I need to keep the comet score intact or do I need to modify it manually (such as change the mt_1 comet to 0)? Thanks for any helpful answers.

Here are two examples:

src: 没有可能
mt_1: 没有可能
mt_2: impossible
ref: not possible

mt_1 gets comet 92.40, mt_2 gets comet 90.15.

src: 张迪鸣介绍说，这套传感器与智能手机相连，呼出气体后，健康报告就会显示在手机上。
mt_1: 张迪鸣介绍说，这套传感器与智能手机相连，呼出气体后，健康报告就会显示在手机上。
mt_2: 张 Dubai introduced the introduction, this set of sensors and smart phones are connected, after the call out the gas, health reports will be displayed on the phone.
ref: Zhang Diming introduced, this set of sensor is connected to the smart phone; after you breathe out, the health report will be shown on your phone.

mt_1 gets comet 71.19, mt_2 gets comet 69.13.

Then I tried to make the whole tgt be same with src and don't change ref on 1875 sents, that means a mt system don't translate any sentence, and gets a Avg comet 65.72. It seems too high.

Code

I used the comet model wmt22-comet-da, my srcipt is:

comet-score -s /home/data/tmp.zh -t /home/data/tmp.en -r /home/data/tmp.ref --gpus 1 --quiet --model /home/models/hf/wmt22-comet-da/checkpoints/model.ckpt > /home/data/tmp.en.comet

Environment

OS: CentOS Linux release 7.9.2009 (Core)
python 3.8.18
torch 2.0.1
transformers 4.30.0.dev0
tokenizers 0.13.3
datasets 2.14.6
accelerate 0.21.0
unbabel-comet 2.2.0

PinzhenChen commented 4 months ago

I think the COMET classifier operates on source/output/reference embeddings and there is no explicit penalty when a hypothesis is in an incorrect language. Maybe you could run language identification on the target side to force-set sentences in a wrong language to a low score.

bhaddow commented 4 months ago

Translating into the wrong language ('off-target translations') are a problem with LLM-based translation. I would always recommend running both a neural metric (like COMET) and a string-based metric (bleu or chrF) as well since the latter are more sensitive to off-target translations.

Unbabel / COMET