dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.02k stars 86 forks source link

Remove extraneous `q_freq` factor from BM25L score #20

Closed Witiko closed 2 years ago

Witiko commented 2 years ago

In the paper, the tftd factor in the BM25L scoring formula is contained within ctd:

BM25 scoring formula

However, in the rank-bm25 library, the tftd factor is used in addition to ctd, leading to tftd · (k1 + 1) · (ctd + δ) in the numerator. This PR fixes that.

dorianbrown commented 2 years ago

Good catch here, after looking over the math in the paper I'd say you're completely right. I'll merge this in and see if I can setup a release soon with these fixes in place.

Thanks a bunch for taking the time to submit a PR!