ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.51k stars 294 forks source link

Too large minhashLSH index #207

Open bryanyzhu opened 1 year ago

bryanyzhu commented 1 year ago

Hi, I have a question about large-scale LSH index. If I have billions of documents, I suppose even 1T RAM is not enough to do in-memory LSH, is there any recommended way to use datasketch for this scenario? Thank you.

I also opened an issue #206 because for a small subset on my local machine (a 6GB pickle file containing pre-computed minhashes), if I use LSH threhold of 0.5 for inserting to minhashLSH, it takes 31GB RAM. If I use leanminhash, it takes 26GB. Then I can do a simple extrapolation, for 600GB pre-computed minhashes, the indexing process will take 3T RAM. This is just too much. Maybe mapping it to disk could be a viable solution. Looking forward to any suggestions, thank you.

ekzhu commented 1 year ago

There is an option to use Redis as the storage layer, which is also in-memory but you can use a cluster.

Currently there is no support for on-disk index. What's your usage scenario? Is it possible to pre-partition the billion documents some way to avoid a single LSH index on all the documents?

bryanyzhu commented 1 year ago

Thank you for your suggestion. I think a distributed database is the way to go. I have tried redis on a single machine, it takes more memory than naive in-memory index, maybe due to the extra persistency.

My usage scenario is to perform global deduplication for billions of documents. If I pre-partition the data, it is not global deduplication anymore. But I guess I can use some heuristic to cluster the data, then perform an approximate global deduplication, what do you think?

ekzhu commented 1 year ago

People have been using the LSH hash keys for partitioning/bucketizing documents for deduplication. See: https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

Document deduplication is necessary in building training datasets because the same content can be present in multiple documents of different datasets. We used a fuzzy deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components in it to identify duplicate documents.

They used this library. You can read their paper about how they did it.

The key is that if you select the right hyperparameters/threshold setting, documents in the same LSH hash bucket have high probability of being duplicate. So you can think of some batch algorithm rather than storing everything in memory. To do this you probably want to use the MinHashLSH.hashtables object directly, some understanding of how the LSH index works also help.

ZJaume commented 1 year ago

I'm trying to do something similar to what is explained in the paper

Exact match duplicates would be computationally expensive, so we opted to take a fuzzy deduplication approach similar to other works [9, 17]. We used a hashing vectorizer with 1,048,576 features to vectorize documents (HashingVectorizer from scikit-learn4), calculated min-hashes of the vectorized documents (using datasketch5), and performed Locality Sensitive Hashing (LSH) through datasketch on all min- hashes in order to identify potential duplicates. We set our LSH parameters in such a way as to increase the likelihood that documents with Jaccard similarity ≥ 0.8 would occur in at least one LSH bucket together. Specifically, we used 20 bands of size 13 for a total of 260 hash functions.

However, they say that they calculate MinHashes of the vectorized documents, but the MinHash.update function takes bytes as an argument and crashes if I give an sparse matrix as argument. Any idea on how could this be done?

ekzhu commented 1 year ago

@ZJaume for one-hot encoded vectors, you want to hash the indices of the dimensions with value 1. MinHash is a lossy compression of set.

ZJaume commented 1 year ago

Do you mean by doing something like this?

text = ['This is an example text']
vectorizer = HashingVectorizer()
msh = MinHash()
X = vectorizer.fit_transform(text)
for i in X[0].indices:
     msh.update(str(i).encode())
ekzhu commented 1 year ago

Yep. Just make sure X[0].indices returns only the non-zero dimensions.

You may want to test different ways of binary encoding your indices to minimize latency on this critical path.

Also MinHash.bulk may be helpful.

duytran1332002 commented 1 year ago

LSH hash keys for partitioning/bucketizing documents

Can you share with me how to do that?

ZJaume commented 1 year ago

I did some modifications that let you build an index that only stores one of the buckets. It is in my fork: https://github.com/ZJaume/datasketch/tree/partitions

liang-ctrl commented 9 months ago

hi, we encountered a similar scenario: there are hundreds of millions of text data to process, so far we tried Spark cluster but it didn't work well. At present, have we tried to come up with a better plan?