dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.02k stars 86 forks source link

The passed-in tokenizer is not being used in the get_scores method. #38

Open Alkacid opened 6 months ago

Alkacid commented 6 months ago

I noticed that a custom tokenizer can be passed in during initialization to tokenize the input documents, but the tokenizer is not used to tokenize the query in the get_scores method. This means that the query needs to be tokenized manually externally. Would it be possible to add the following content at the beginning of the get_scores method:

if self.tokenizer:
    query = self.tokenizer(query)
dorianbrown commented 2 weeks ago

I do like the idea for this change, but am a little worried about making backward-incompatible changes considering how many people seem to be using the package.

I'd like to leave this issue open and see if there's more support for this.

jankovicsandras commented 3 days ago

I think it's very important to ensure that the same tokenizer function is used in index creation and at query time. I made an optimized rewrite where the tokenizer function is registered in the class: https://github.com/jankovicsandras/bm25opt