IDF in BM25Okapi is not version from Atire

msalwen commented 1 year ago

In Trotman, Jia & Crane the idf measure is given as log(N/df_t) (see top of page 5), where N is the corpus size and df_t is the number of docs containing term t. This is always non-negative. In the implementation of BM25Okapi, you have used an earlier version of the idf computation (Robertson-Spark Jones, see eq (4) bottom of page 2 in Trotman, Puurula & Burgess), which can become negative and which you handle by setting negative values to eps.

The balance of the score calculation in BM25Okapi follows Atire exactly, and the implementations for BM25L and BM25+ align perfectly with the descriptions in Trotman, Puurula & Burgess. I wonder why the idf implementation for BM25Okapi deviates from the Atire specification.

dorianbrown commented 1 month ago

Sorry for the late reply, but this project has a been a little dormant for me.

When trying to address some issues, I started looking into references of the implementation, and I've run into many different variations of the Okapi/ATIRE algorithm. When implementing this one, I'm not sure why I ended up following the Robertson-Spark Jones formulation, but do I understand correctly that log(N/df_t) implementation is the preferred one in this case?

I'm also wondering if it makes sense to include the mentioned variations in some way (i.e. rsj_idf = True for Robertson-Spark Jones), but for that it would be useful to get some feedback from you guys.

dorianbrown commented 1 month ago

Some of the confusion also came from the wikipedia page citing the Robertson-Spark Jones IDF: https://en.wikipedia.org/wiki/Okapi_BM25

dorianbrown / rank_bm25

IDF in BM25Okapi is not version from Atire #35