question: Comparing models with multiple runs

t-mesq commented 2 years ago

First of all, great work on this code. I have been looking for a definitive package to evaluate ranking models and I believe this is that package.

My question is perhaps a bit out of the domain, but it could help others in the future. How would you deal with comparing 2 models where each has multiple runs (e.g., runs with different random initialization and/or batch shuffling, for confidence intervals). I was thinking that perhaps the significance testing could be performed between the mean (across runs) metric_scores vectors.

Thanks in advance,

Tiago

AmenRa commented 2 years ago

Hi and thanks for the kind words! :)

That's a tricky question. I actually don't have a scientific answer, but your idea sounds reasonable to me.

Right now ranx does not support what you say out of the box but it can probably be achieved using the "low level" function you find in the source code.

An alternative could be to consider every variation of your model separately, run the comparison and then do something like "3 out of 4 model variations significantly improved over the baseline(s) so we can say that, generally, the model is better".

It would be safer to say that only if all your model variations improved over the baseline(s) to be honest...

You should talk with someone more experienced than me to find a scientific answer to your problem.

If you do, please, keep me posted!

Elias

t-mesq commented 2 years ago

Thank you for the quick response!

Surprisingly haven't been able to find any discussions on this subject specifically, but have no doubt that they must exist :). Your alternative is definitely the safest approach, but testing all combinations for a large number of models and runs can become pretty computationally expensive.

If I come across the scientific answer to this I will definitely update you.

Best regards,

Tiago

AmenRa commented 2 years ago

Closing. Feel free to re-open if needed.

AmenRa / ranx

question: Comparing models with multiple runs #6