Different versions of COMET code give different scores with the same model and date.

🐛 Bug

Using COMET 2.2.1 (and Python 3.9) and the wmt22-comet-da model I get a score of 0.7982, but using COMET 1.1.2 (and Python 3.7) I get a score 0.8618. This is on exactly the same source, target and reference file.

I appreciate that 1.1.2 is an old version, and should not be used, but many people will have old versions installed, and be unaware that they should not be used with new models. The consequence of this bug is that research papers should provide both the model of COMET used, as well as the version of the software.

To Reproduce

Install COMET 2.2.1 on Python 3.9, score test files. I used an en->mt translation of NTREX with this model https://huggingface.co/HPLT/mt-mt-en-v1.0-hplt_opus. I have attached the src, hypo and ref files.

Install COMET 1.1.2 on Python 3.7, score the same files.

Compare scores.

Expected behaviour

With the same model and data, COMET should give the same scores.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: Ubuntu 20.04.6 LTS Versions: 2.2.1 and 1.1.2 wmt22-comet-da

hypo.txt ref.txt src.txt

Unbabel / COMET