dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.02k stars 86 forks source link

Debug BM25Okapi #26

Open LowinLi opened 2 years ago

LowinLi commented 2 years ago

In the "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so the BM25 score also will be negative. So this commit will debug this error.

dorianbrown commented 5 months ago

I think I finally found where this motivation came from, namely this section from here:


Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be log(10−6+0.5/6+0.5)=log(4.5/6.5).

Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.

dorianbrown commented 5 months ago

I wonder if it might be more simple to just go with the "smoothed" IDF function IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5), which ensures that IDFs are always positive. That way we don't have to do all this checking for negativity stuff.

What do you think?