Closed marcelodiaz558 closed 2 years ago
After doing some tests, I don't think such reduction is feasible for my project since the serialized embeddings would still be very big on disk given the big vocabulary I have and the number of documents.
However, I have some ideas to slightly modify the algorithm and make it consume less RAM at the cost of reducing inference speed. I'll work on that and try contributing once I have some free time.
Hi, I have a dataset of around 5M documents (50 GB raw) I'd like to index with BM25. Indexing a subset of them (50K) works perfectly. However, when I try indexing all of them the script crash silently after a while, my machine has 25GB of Ram but looks like it's not enough.
To save as much memory as possible, I'm passing an iterator to BM25Okapi that reads the documents 1 by 1 from the disk.
Any ideas on how to minimize the memory usage for such scenarios? (if possible) If I manage to handle it I can help with a pull request including an optional memory-friendly parameter or something similar.