Different scores from different COMET package versions 1.1.2 and 2.2.1

PinzhenChen commented 4 months ago

🐛 Bug

When the same source, target, reference files are evaluated using the same wmt22-comet-da checkpoint, unbabel-comet 2.2.1 under python3.9 and unbabel-comet 1.1.2 under python3.7 gave me dramatically different numbers.

To Reproduce

In python3.7, pip install --upgrade unbabel-comet gives 1.1.2 as the latest version, while in python3.9 it gives 2.2.1.

Scoring the same source, target, and reference files under the above two environments gave different scores. unbabel-comet 1.1.2 results in a score of 0.86 while the 2.2.1 version gave 0.79. I used WMT22-COMET-DA downloaded from Hugging Face https://huggingface.co/Unbabel/wmt22-comet-da.

Attaching the files which gave 0.79 and 0.86 below, but I think any file combination can be used to reproduce this behaviour since it's associated with the COMET package version. target.en.txt source.mt.txt hypothesis.en.txt

Expected behaviour

I would expect different COMET package versions to give the same score if the same checkpoint and files are given.

Environment

Managed python3.7 and python3.9 with conda.

Additional context

If there is indeed some package mismatch between unbabel-comet 1.1.2 and 2.2.1, it might be difficult to go back and fix the problem. Users probably are unaware of this and will not update. Moreover, python3.7 only supports 1.1.2 as the latest even if users upgrade COMET in python3.7. Maybe this behaviour can be highlighted in README to encourage the user to use specific Python and unbabel-comet versions . On the other hand, this could imply that research papers should report COMET package version in addition to COMET version. Would it be possible to implement some kind of COMET signature just like that in sacrebleu?

BramVanroy commented 4 months ago

This confirms what we learnt for BLEU, too: one should ALWAYS report version numbers (signatures), also for COMET!

Side note: in my MATEO, I added a custom signature for neural metrics like bertscore, bleurt and comet, too. For COMET it looks like this (inspired by sacrebleu):

comet: nrefs:1|bs:1000|seed:12345|c:Unbabel/wmt22-comet-da|version:2.0.1|mateo:1.1.3

where c stands for the checkpoint used and version is self-explanatory. Wasn't sure how far one had to go with this because difference in torch, cuda and transformers versions may or may not also lead to difference in results. Hell, even then the CUDA optimisation might lead to different results on different hardware.

PinzhenChen commented 4 months ago

Admittedly the README currently says it requires 3.8, so maybe I installed COMET in the stone age and pip install —upgrade unbabel-comet never warned me. Anyway I think the score mismatch should not be expected

Your signature is very thoughtful!

Unbabel / COMET