AmenRa / ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
https://amenra.github.io/ranx
MIT License
427 stars 23 forks source link

question: why student rather than fisher stat test? #37

Closed PaulLerner closed 1 year ago

PaulLerner commented 1 year ago

Hi,

Just a quick question: I wondered what motivated the choice for changing the default value of the stat test to student instead of fisher https://github.com/AmenRa/ranx/commit/0dc8d9cf6b76ece0c8bccad267e2ce4f226127c3 (I almost published them as is before figuring it out :sweat_smile:).

I thought that one of your documentation pointed me to this paper of Smucker et al. that suggests using Fisher (and especially not Student), but maybe I don’t recall correctly.

Btw, the docstring still shows 'fisher' as default https://amenra.github.io/ranx/compare/

AmenRa commented 1 year ago

Hey, I got a pull request for fixing the docs today :D

Student's and Fisher's tests very often agree (always?), but Student's is immensely faster to compute. That's why I changed the default stat test.

The paper you mentioned says the opposite. From the introduction:

Student’s t, bootstrap, and randomization tests largely agree with each other.
Researchers using any of these three tests are likely to draw the same
conclusions regarding statistical significance of their results.

There's also a nice table showing that: Screenshot 2023-02-03 at 12 13 06

PaulLerner commented 1 year ago

This is the ‘practical’ conclusion with their experiments on TREC data. They have further discussion in §5.2 and in the end recommend to use Fisher’s randomization.

Thank you for your quick answer (as always :))

AmenRa commented 1 year ago

Sorry, I completely forgot about that section.

From my experiments on very different datasets from those used in the paper, Student's and Fisher's tests always agreed (comparing mean values).

I see the point about tiny test sets. However, I question the validity of tiny test sets. Are they representative of the general user behavior? What's the confidence that a model outperforming another on a small set of queries will work better on different ones? Is a small set of queries representative of the actual population of queries in the real world?

Obviously, we could be interested in testing a specific niche of queries depending on the use case.

What do you think about it?

PaulLerner commented 1 year ago

It makes sense but I am not so familiar with statistical significance testing. But I think that small test sets will not get low p-values anyway. The opposite problem, namely that huge datasets will get low p-values very easily, is discussed in §4.4 of https://www.morganclaypool.com/doi/abs/10.2200/S00994ED1V01Y202002HLT045