Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
493 stars 76 forks source link

Can't reproduce Cometinho model scores #127

Closed Rexhaif closed 1 year ago

Rexhaif commented 1 year ago

🐛 Bug

Cometinho paper states that pruned model achieves 0.274 Kendall tau score on news subsection of WMT21 EN-RU MQM dataset. Their distilled model achieves 0.263 Kendall tau score. However, simple reproduction script using wmt21-cometinho-mqm models fails to show similar results on the same data.

To Reproduce

  1. Prerequirements - Download model and dataset
    
    # wmt2021-mqm link mentioned in https://github.com/Unbabel/COMET/tree/master/data
    wget https://unbabel-experimental-data-sets.s3.eu-west-1.amazonaws.com/comet/data/2021-mqm.tar.gz
    tar xzf 2021-mqm.tar.gz

cometinho-mqm model from https://github.com/Unbabel/COMET/blob/master/MODELS.md

wget https://unbabel-experimental-models.s3.amazonaws.com/comet/wmt21/wmt21-cometinho-mqm.tar.gz tar xzf wmt21-cometinho-mqm.tar.gz

2. Run the code
```python
import pandas as pd
import torch
torch.set_float32_matmul_precision("medium")
from scipy.stats import kendalltau
from comet import load_from_checkpoint

model = load_from_checkpoint("./wmt21-cometinho-mqm/checkpoints/model.ckpt")

dataset = pd.read_csv("./2021-mqm.csv")

eval_set = dataset.query("lp == 'en-ru' & domain == 'news'")

data = eval_set[['src', 'mt', 'ref']].to_dict(orient='records')
scores = eval_set['score']

cometinho_scores = model.predict(data, batch_size=256, gpus=1)

tau, p = kendalltau(cometinho_scores['scores'], scores)
print(f"Kendall tau: {tau:.6f}")

Expected behaviour

Provided script should print Kendall tau: 0.263 or a close number.

Environment

OS: Ubuntu Linux 20.04 Kernel 5.17.5 inside docker container Hardware: Nvidia RTX 3090 Packaging: pip Version: 2.0.0

Additional Comments

  1. When testing on zh-en news subset reproduced scores are at least close to paper scores: 0.3549 reproduced vs 0.321 in paper
  2. torch.set_float32_matmul_precision("medium") impacts only GPU utilization & processing speed, scores are the same
ricardorei commented 1 year ago

Hi @Rexhaif I think your problem is on the models you are using. Have you tried these ones?

The confusion is that the "first" versions of cometinho were actually trained for WMT 21. They are not distilled versions of larger COMET models but rather smaller encoders trained with the same data. Then for EAMT Conference we experimented with distillation of larger models (which allowed us to train the smaller encoder in much more data) those are the results presented in COMETINHO paper.

Sorry about the confusing names. I hope with this you are able to reproduce the results

Rexhaif commented 1 year ago

Hi, thanks for the clarification and the model names!

Are there any distilled/pruned COMETINHO models available that were trained on the MQM scores rather than the DA scores?

ricardorei commented 1 year ago

Unfortunately no.