Closed st-vincent1 closed 1 year ago
You can use comet-compare
command for that.
It returns the comparison of 1 or more systems using statistical significance.
Leaving here the example from README
comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en
🚀 Feature
COMET could support statistical significance tests such as paired bootstrap resampling for comparing across systems.
Motivation
Statistical significance tests, e.g. paired bootstrap resampling are encouraged as standard practice in research. Currently they are easily available for calculated metrics (BLEU, chrF etc.) via sacrebleu. However, their approach does not support custom metrics.