dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.02k stars 86 forks source link

Case insensitive BM25 #44

Closed Grecil closed 2 weeks ago

Grecil commented 4 months ago

I was working on a RAG project where I want to use BM25 for hybrid search. The documents in question contain keywords like "ABC101", "XY22", "LM99" etc. I cannot expect user to enter these keywords in correct case. However, I do want the documents to store and display the correct case. A naive approach could be to convert entire documents to lowercase and then put them in corpus. This is counter intuitive. Another approach could be to convert the tokens to lower case temporarily during matching. I am not sure if this has already been done or not. If you can implement this functionality, it will be very helpful.

axeltorbenson commented 3 weeks ago

Hi Grecil,

From the readme, you can see this package does not do any text preprocessing: image

I'm not sure I understand your problem, but you can always create an Id for each document, then have an entry for the raw text and another entry for the processed text, and only show the raw text to an end user. But that's out of scope of this package.

dorianbrown commented 2 weeks ago

If you want to display the set of documents in their unmutated state, I'd recommend storing them in that state in separate object, and using the index returned by the search algorithm to then fetch them from that other data store.