v1.x and v2.x have different scores for wmt20-comet-qe-da model

thammegowda commented 1 year ago

🐛 Bug

Scores do not match between v1.x and v2.x for wmt20-comet-qe-da model.

To Reproduce

Assuming you have (mini)conda in your system ...

# create conda envs
conda create --yes -n cometv1 python=3.10
conda create --yes -n cometv2 python=3.10

# install unbabel-comet
conda run -n cometv1 --live-stream pip install unbabel-comet==1.1.3
conda run -n cometv2 --live-stream pip install unbabel-comet==2.0.1

# get some test data
sacrebleu -t wmt21/systems -l en-zh --echo src > src.txt
sacrebleu -t wmt21/systems -l en-zh --echo Online-A > mt.txt

# test 
conda run -n cometv1 --live-stream comet-score --model wmt20-comet-qe-da -s src.txt -t mt.txt | tail -1
conda run -n cometv2 --live-stream comet-score --model Unbabel/wmt20-comet-qe-da -s src.txt -t mt.txt | tail -1

v1 gives 0.2254 v2 gives 0.2046

Additional Info:

v2 gives negative scores, v1 has no negatives.

conda run -n cometv2 --live-stream comet-score --model Unbabel/wmt20-comet-qe-da -s src.txt -t mt.txt | awk '$NF < 0'

Expected behaviour

v2.x should have same scores for v1.x.

Environment

OS: Linux Packaging: pip, conda Version: v1.1.3, v2.0.1

ricardorei commented 1 year ago

Hi @thammegowda, yes the wmt20-comet-qe-da from v2.0 is equivalent to the wmt20-comet-qe-da-v2 from v1.0.

This is because we found an error in the training of the initial wmt20-comet-qe-da. That model was trained with a sigmoid output while regressing on a score that was not normalizes between 0 and 1. This did not affect correlations but the scores were weird and many times close to 0. We fixed that by retraining the model and releasing a replacement.

Cheers, Ricardo

thammegowda commented 1 year ago

Hi @ricardorei Thanks for the clarification.

Unbabel / COMET