dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
983 stars 83 forks source link

Support non-repeatably iterable corpus for tokenizer=None #17

Closed Witiko closed 2 years ago

Witiko commented 2 years ago

Currently, rank-bm25 requires that corpus is repeatedly iterable and sized (i.e. defines __len__()).

When the corpus is not pre-tokenized (i.e. tokenizer is not None), then this makes sense: __init__() will iterate across the corpus several times, so we may as well require that the corpus is a list or some other data type that is repeatedly iterable and sized. However, when the corpus is pre-tokenized (i.e. tokenizer is None), then we only iterate over the corpus once in _initialize(). Furthermore, we don't need to know its size beforehand, because we can just count the number of iterations.

This pull request makes it possible to use a non-repeatedly iterable non-sized corpus such as a generator when tokenizer is None. This is useful if you need to generate your corpus on the fly and don't know the number of your documents beforehand.

Witiko commented 2 years ago

@dorianbrown Please, let me know if there is anything I can do to get this merged.

dorianbrown commented 2 years ago

Hi Wikito,

Thanks a lot for this contribution, the case of using a generator as a corpus never occurred to me, but seems like a very useful bit of functionality to have. And thanks for you patience, it's been a bit of a busy few weeks :smile:

After looking through the changes this doesn't seem to cause any issues, so I'll merge it in and check if my CD is still working correctly.

dorianbrown commented 2 years ago

The branch has been merged, and the changes have been published to pypi under the version 0.2.2 of the package.

Thanks again for helping make this package better!

Witiko commented 2 years ago

@dorianbrown Thank you, much appreciated.