Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
493 stars 76 forks source link

Bootstrap resampling #95

Closed st-vincent1 closed 1 year ago

st-vincent1 commented 1 year ago

🚀 Feature

COMET could support statistical significance tests such as paired bootstrap resampling for comparing across systems.

Motivation

Statistical significance tests, e.g. paired bootstrap resampling are encouraged as standard practice in research. Currently they are easily available for calculated metrics (BLEU, chrF etc.) via sacrebleu. However, their approach does not support custom metrics.

ricardorei commented 1 year ago

You can use comet-compare command for that.

It returns the comparison of 1 or more systems using statistical significance.

ricardorei commented 1 year ago

Leaving here the example from README

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en