How to handle big datasets?

marcelodiaz558 commented 2 years ago

Hi, I have a dataset of around 5M documents (50 GB raw) I'd like to index with BM25. Indexing a subset of them (50K) works perfectly. However, when I try indexing all of them the script crash silently after a while, my machine has 25GB of Ram but looks like it's not enough.

To save as much memory as possible, I'm passing an iterator to BM25Okapi that reads the documents 1 by 1 from the disk.

Any ideas on how to minimize the memory usage for such scenarios? (if possible) If I manage to handle it I can help with a pull request including an optional memory-friendly parameter or something similar.

marcelodiaz558 commented 2 years ago

After doing some tests, I don't think such reduction is feasible for my project since the serialized embeddings would still be very big on disk given the big vocabulary I have and the number of documents.

However, I have some ideas to slightly modify the algorithm and make it consume less RAM at the cost of reducing inference speed. I'll work on that and try contributing once I have some free time.

Facico commented 2 years ago

I think pyserini might be helpful to you.

dorianbrown / rank_bm25

How to handle big datasets? #24