Open Witiko opened 2 years ago
@dzieciou Different implementations of BM25 take different liberties with the original algorithm, including (Py)Terrier. If the rank_bm25 library is to implement the algorithm as it was originally described, then it should treat the query terms as a set, not as a multiset.
However, I am satisfied with the behavior being documented, at least in an open issue on GitHub if not elsewhere.
@dorianbrown In the seminal paper for this package, the Okapi at TREC-3 paper, and most other places, BM25 is defined over query terms rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:
https://github.com/dorianbrown/rank_bm25/blob/329b794e726fd513eb96d9e28dcf4db8de399ea7/rank_bm25.py#L117
This can be easily solved by the user by passing
set(query)
1 rather thanquery
to theget_scores()
method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.1 Alternatively,
list(dict.fromkeys(query))
for reproducible ordering, since floating point summation is not always associative.