dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.02k stars 86 forks source link

Detect presence instead of frequency #37

Closed kripper closed 2 weeks ago

kripper commented 8 months ago

For our use case (identify certificate types) we want to retrieve docs that contain certain keywords without considering the number of times a keyword is present in a given document. If a keyword repeats many times in the document, it shouldn't have more score than if it only appears once.

For our use case the score should be given by the number of different keywords that appear in the text. Each keyword apprearence should sum a predefined keyword-score.

It is also desirable that keywords can be formed by single or multiple words separated by spaces (eg: the keyword "certificate of origin" will have a predefined bigger score then the keyword "certificate").

Does this implementation support this use case?

dorianbrown commented 2 weeks ago

It isn't currently supported, but if you need it I think making the modifications yourself would be fairly simple.

kripper commented 2 weeks ago

It isn't currently supported, but if you need it I think making the modifications yourself would be fairly simple.

Thanks. We finally implemented our own algorithm.