Memory handling for LSH index with billions of items

ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

https://ekzhu.github.io/datasketch

MIT License

2.59k stars 295 forks source link

Memory handling for LSH index with billions of items #19

Closed duhaime closed 7 years ago

duhaime commented 7 years ago

Hello, I'm really liking this library, but have recently started running into RAM limitations trying to handle billions of MinhashLSH index members on a 16GB quadcore OSX.

Right now, I'm essentially looping over my objects, hashing each, and adding the hash result (of class Minhash) to one large MinhashLSH index for subsequent querying. Is there a less memory intensive way of accomplishing a similar goal? Ideally, I'd like to post the Minhash results to a db then query the db for matches so as to afford much higher scalability.

I'm happy to help put some of this in place if it's not already a part of this toolkit, but wanted to ask in the meantime to see if your team had any immediate thoughts. In any event, thanks again for this awesome work.

ekzhu commented 7 years ago

Thank you for your interest in this project! It's really encouraging!

This is a very interesting issue. I my past research projects when I needed to create MinHash LSH indexes for hundreds of millions of sets, I used another library in Golang. It is much more memory efficient.

However I am very much interested in making this Python library memory efficient. One thing I learnt from using a (disk-based) database for storing the LSH index is that it is very slow. See https://github.com/ekzhu/go-sql-lsh. So I think we have two options: 1) Optimize the Python code to make it more memory efficient. I am sure a constant factor improvement can be done, since the current code was not written with any optimization in mind. 2) Utilize a storage engine to store the MinHash LSH data structure. The disk I/O can be a huge issue here. Perhaps this can be overcome by using distributed in-memory systems or smart caching.

I am curious, what is the memory efficiency of using the current MinHash LSH implementation? Approximately how many MinHashes per 1 GB of memory usage?

duhaime commented 7 years ago

Thanks so much for your response, @ekzhu, it's very helpful to see these other libraries!

I took a look through the Minhash and LSH classes last night looking for some memory optimizations and ended up getting quite an education! If you are aware of memory optimizations that can reduce storage requirements within an in-memory implementation, those would be great, but I certainly didn't see anything obvious.

Even with additional optimizations in place, I suppose for any arbitrarily sized project db storage would help. My main question now is whether this library is the place for the db hooks. Right now, I'm thinking the Minhash and LSH classes are beautifully clean, so it might make more sense for users to subclass the LSH class and modify the insert() and query() methods to suit their own situations.

For the moment, I think I'll go in that direction, rather than introduce any changes to this excellent API, so it should be safe to close this issue.

ekzhu commented 7 years ago

Thank you for kind words! In fact @fpug is working on a new MinHash class LeanMinHash that reduces the memory usage and provide faster serialization than the current MinHash class. This should be helpful when you want to deal with external storage.

If the external storage solution is generic enough, it may be a good addition to this library.