Open Alkacid opened 6 months ago
I do like the idea for this change, but am a little worried about making backward-incompatible changes considering how many people seem to be using the package.
I'd like to leave this issue open and see if there's more support for this.
I think it's very important to ensure that the same tokenizer function is used in index creation and at query time. I made an optimized rewrite where the tokenizer function is registered in the class: https://github.com/jankovicsandras/bm25opt
I noticed that a custom tokenizer can be passed in during initialization to tokenize the input documents, but the tokenizer is not used to tokenize the query in the
get_scores
method. This means that the query needs to be tokenized manually externally. Would it be possible to add the following content at the beginning of the get_scores method: