dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.05k stars 89 forks source link

Bug while calculating idf for `BM25Okapi` #29

Closed SpyrosRoum closed 1 year ago

SpyrosRoum commented 1 year ago

Hello, not sure if my understanding is not good enough but when trying to use the BM25Okapi algorithms, sometimes I get a score of zeroes even though there are matches.

I followed the code with a debugger and found that this line sometimes returns just 0, which seems wrong?

Here is a minimum reproducible example:

from rank_bm25 import BM25Okapi

def main():
    corpus = [
        "Hello there good man!",
        "It is quite windy in London",
        "It's windy in Athens",
        "How is the weather today?",
    ]

    bm25 = BM25Okapi(corpus, tokenizer=str.split)

    query = ["windy"]

    doc_scores = bm25.get_scores(query)
    matched_docs = bm25.get_top_n(query, corpus, n=2)
    print(doc_scores, matched_docs)

if __name__ == '__main__':
    main()
nocoolsandwich commented 1 year ago

idf is 0,set default idf as 0.00000001 can fix this problem