dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
983 stars 83 forks source link

'documents' argument in get_top_n #4

Closed lambdaofgod closed 2 years ago

lambdaofgod commented 4 years ago

What is this argument for? It seems it can only introduce a bug since it is checked that there are as many documents as in corpus

dorianbrown commented 4 years ago

First off, thanks for your interest! I made this for project I was doing last year, but don't use it that much anymore.

Regarding the argument, it's used to retrieve the original document as the return value. The original documents themselves aren't saved in the BM25() object, only the tokenized corpus. It would be possible to save both, but would use more memory.

Out of curiosity, does the current implementation cause problems for you? I'm open to changing it