Closed bryanyzhu closed 1 year ago
Lower thresholds tend to increase memory usage because you need a more hash tables to recover the low-similarity results that have lower probability of hash collisions with the query.
Larger on-disk size is probably due to serialization.
For deduplication usage cases, maybe some high threshold is enough? Retrieval with low similarity threshold tends to become hard for LSH-based indexes.
Thanks for your explanation, I will go with high threshold now.
First of all, thanks for the great library, it works great. I'm using the standard MinHash and MinHashLSH to perform large-scale deduplication. But when I try to play with different hyperparameters, I find that the threshold when creating the MinHashLSH has a big impact on the memory usage.
For example, when I experiment on a subset of my in-house data, a threshold of 0.8 only consumes
10GB
memory when creating the index, but a threshold of 0.1 consumes more than50GB
memory. In addition, the MinHashLSH index when pickled to disk is also bigger,2.5GB
for threshold 0.8 while4.1GB
for threshold 0.1. I have checked the documentation and issues, but couldn't figure out the reason for such a drastic memory increase. I also don't understand why the index with threshold 0.8 is bigger on disk even though the number of entries are the same for both indexes. Can anyone help with some suggestions? Thank you.