ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.59k stars 296 forks source link

Support for MySQL as LSH storage #158

Closed surkova closed 2 years ago

surkova commented 3 years ago

There's a Go implementation for LSH storage for Postgres and SQLite, though there's none for MySQL. Neither of relational databases is supported directly in this library, I wonder why?

ekzhu commented 3 years ago

Memory is often large enough for most usage scenarios of LSH. Plus in-memory data structure is better optimized for querying and updates. If you want persistence, just use pickle.dump and pickle.load. The LSH is not your primary data store just a retrieval index, so it doesn't hurt to rebuild it from scratch -- it is also fast.

I am not sure what's the usage scenario for SQL-based LSH storage though.

surkova commented 3 years ago

@ekzhu we are now looking at using datasketch in a production system which will be doing MinHash search in the LSH on online traffic. If not found a similar MinHash, we would be adding it to the LSH.

Since we are working on the endless amount of data with no clear start or end, with the current traffic, we see that we might make it about 1 year before filling up 32Gb of Redis. Ideally, we do not want to drop items out of Redis (mainly for keeping it simple, it doesn't look possible though), but there are also cases when Redis might get out of commission due to infrastructure issues and we need to restore it from some persistent storage. Cassandra unfortunately is not an option for us right now, since we don't have it on our production platform.

pickle is good as long as no other programming languages are involved or python's implementation of pickle is compatible between different versions. Ideally, I'd want some more universal serialization of the LSH object, similar to what already exists on LeanMinHash.

I looked at your example of the LSH implementation for relational DB and found that for proper lookup there needs to be indexes on each column storing a hashed value. My understanding is that with 128 permutations (which we chose for our case) there will need to be at least 129 indexes (primary id + hashed values columns) and Postgres allows that, while MySQL limits the maximum number of indexes by 64 per table.

ekzhu commented 3 years ago

Cool scenario!

Do you think having multiple indexes may help with the storage issue with a single index? You can parallelize query over multiple indexes. Of course there will be duplicate across different indexes but you can always dedup in post-processing. Also, once in a while you can compact your indexes by inserting all the entries in to one.

I think a universal serialization method would be great for persistent storage: Stores the LSH index in a binary format, and deserailize into various storage engines such as memory, Redis, etc.. We may be able to use a well-supported binary format such as Numpy's.

surkova commented 3 years ago

We are looking into horizontal scaling of Redis instances with help of a proxy, but for this we would need some way to route how the new keys are being inserted.

Do you think you might be adding LSH serialization to the library in the near future?

ekzhu commented 3 years ago

Do you think you might be adding LSH serialization to the library in the near future?

Sorry for the late late reply. I don't have plan to add new features in the near future. But if you submit a pull request I would be happy to review it.