ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.56k stars 293 forks source link

Impact of MinHashLSH threshold on memory usage #206

Closed bryanyzhu closed 1 year ago

bryanyzhu commented 1 year ago

First of all, thanks for the great library, it works great. I'm using the standard MinHash and MinHashLSH to perform large-scale deduplication. But when I try to play with different hyperparameters, I find that the threshold when creating the MinHashLSH has a big impact on the memory usage.

For example, when I experiment on a subset of my in-house data, a threshold of 0.8 only consumes 10GB memory when creating the index, but a threshold of 0.1 consumes more than 50GB memory. In addition, the MinHashLSH index when pickled to disk is also bigger, 2.5GB for threshold 0.8 while 4.1GB for threshold 0.1. I have checked the documentation and issues, but couldn't figure out the reason for such a drastic memory increase. I also don't understand why the index with threshold 0.8 is bigger on disk even though the number of entries are the same for both indexes. Can anyone help with some suggestions? Thank you.

ekzhu commented 1 year ago

Lower thresholds tend to increase memory usage because you need a more hash tables to recover the low-similarity results that have lower probability of hash collisions with the query.

Larger on-disk size is probably due to serialization.

For deduplication usage cases, maybe some high threshold is enough? Retrieval with low similarity threshold tends to become hard for LSH-based indexes.

bryanyzhu commented 1 year ago

Thanks for your explanation, I will go with high threshold now.