Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
493 stars 76 forks source link

t-test is two-tailed instead of one-tailed #104

Closed erip closed 1 year ago

erip commented 1 year ago

🐛 Bug

The code to perform paired t-test is two-tailed instead of one-tailed. The alternative hypothesis that users typically care about is that the baseline mean score is less than sys1's mean score, but that is not reflected in the test.

To Reproduce

See here :-)

Expected behaviour

The test should probably use alternative="less" or otherwise be configurable.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: [e.g. iOS, Linux, Win] N/A Packaging [e.g. pip, conda] N/A Version [e.g. 0.5.2.1] N/A

Additional context

ricardorei commented 1 year ago

I am adding a flag for t_test alternative and setting it by default to "less".

ricardorei commented 1 year ago

This will be available on the next release

erip commented 1 year ago

Perhaps for the sake of not breaking the API, setting the default to "two-sided" might be a better option until the next major release? I'd hate to be the cause of people's papers comparing apples and oranges. I think with the ability to configure it, this should cover the issue well.

ricardorei commented 1 year ago

@erip Thats a good point thanks!

The next release will be 2.0 and we are also going to replace the default models with new ones.

We will make it clear that scores (with default settings) won't be directly comparable to the previous version (1.1.3).

There will be backward compatibility but default options will probably change for all 3 commands: comet-score, comet-mbr and comet-compare

ricardorei commented 1 year ago

I agree with you that people typically want to know if baseline mean score is less than sys1's mean score. I think its a good call to change the t_test to less. What you think?

Atm this is only updated in the fix-multigpu branch and not merged into master.

erip commented 1 year ago

It seems very reasonable to me. I'm hoping there's not some nuance that I've overlooked here. I can look at sacrebleu to see what their alternative hypothesis is in their tests (for sake of consistency more than correctness).

ricardorei commented 1 year ago

Perfect! Thanks!

erip commented 1 year ago

Unless I'm misreading their code, it seems like they're testing using a two-sided alternative hypothesis due to the absolute value.

ricardorei commented 1 year ago

@erip I looked a bit more into this and indeed two-sided t_test is more usual and results made more sense in my tests. Nonetheless I am keeping the option to change that in the command line. I am going to merge v2.0 into master.

The release was delayed but at least master will contain the new changes