Closed TevenLeScao closed 1 year ago
They are different technically, but effectively the same. $k$ in that paper refers to number of hash functions, while number of permutations in datasketch
refers to another version of MinHash using just one hash function.
Variant with many hash functions VS Variant with a single hash function
Also in datasketch
, r
and b
are calculated for optimal FN and FP rates (with equal weights by default).
see here for the optimal calculation.
You can update the code to set both b and r, see source
There isn't a direct way to reproduce that paper if I read it correctly as it involves using edit similarity as a second step to reduce false positives, which is not included in this repo but should be easy to add.
There is also something else that is unclear from the paper — what they do with a connected component of duplicates.
364,613,570 records in C4 isn't something I think the single script can handle. I believe Google used some versions of distributed clusters and implementations for both that paper and the SimHash one.
Even if it scales linearly, my script (the fastest one is in BigCode) handles 15M docs in 3.5 hrs, which means it will take 85 hours to dedup C4.
Thanks for the quick answer! I assume that fastest script you mentioned is the one at https://github.com/bigcode-project/bigcode-analysis/pull/11, right? How large was data/python
in this?
Here https://github.com/ChenghaoMou/bigcode-analysis/blob/minhash_improve/data_analysis/near-deduplication/minhash_deduplication_alt.py I haven't merged it yet.
Here are the results on data/python
using an 80-core machine (memory consumption ~300 GB):
[12/09/22 20:24:44] INFO load_dataset : 28.08 seconds minhash_deduplication_alt.py:167
INFO minhash : 3689.60 seconds minhash_deduplication_alt.py:167
INFO clustering : 6322.42 seconds minhash_deduplication_alt.py:167
INFO filtering : 2235.21 seconds minhash_deduplication_alt.py:167
INFO save : 1478.33 seconds minhash_deduplication_alt.py:167
INFO Data Number (before) : 15148604 minhash_deduplication_alt.py:168
INFO Data Number (after) : 13032937 (86.03%) minhash_deduplication_alt.py:169
INFO Duplicate Number : 2115667 (13.97%) minhash_deduplication_alt.py:170
INFO Total Time : 13753.72 seconds minhash_deduplication_alt.py:171
INFO Deduplicated Dataset : results/output/deduplicated minhash_deduplication_alt.py:172
INFO 🤗 Happy Deduplicating 🤗 minhash_deduplication_alt.py:173
How big was that dataset? Would it scale linearly to, say, 1T?
Okay, I see what you mean. The physical size (bytes) usually matters less than the number of docs for these methods. You could have 800GB of java data in 42M files while 24GB with 15M python files.
Based on my experiments, it is roughly linear in terms of number of files.
Do you have a hf dataset link? Maybe I can give it a try to see how long it will take.
I am also interested in reproducing the paper or at least doing large scale deduplication if not fully implement all the details. I guess pyspark is the best cost/time efficient solution here?
Yes, the pyspark version should handle TB level datasets well given enough compute, but it won't be an exact replica of what was used in the paper.
Here is an early analysis on the scaling property on The Stack dataset:
I wonder if Lee et al. paper has the r and b reversed in the paper by accident:
After some experimentation, we
chose to use b = 20, and r = 450, so k = 9, 000,
so as to make sure a collision at the desired Jaccard
index threshold of 0.8 had a high probability of
occurring.
1 - (1 - thresh**r)**b
would yield ~0, but if you flip it then it is ~1. Which makes sense higher the number of bands there is greater chance of at least one collision.
Also I am suspicious about the optimal r
and b
values set by datasketch.
For example for a given budget (k) and target threshold the following gives the r,b values where we have the steepest slope at the point=threshold:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
"The threshold is roughly where the rise is
the steepest, and for large b and r there we find that pairs with similarity
above the threshold are very likely to become candidates, while those below the
threshold are unlikely to become candidates – exactly the situation we want."
An approximation to the threshold is (1/b)^(1/r). # loss I used below to solve for r and b
thresh = 0.7 # target threshold from data (e.g. 0.7 is a good similarity that i would like to look at some clusters for deduping my dataset)
k = 100 # budget
loss = np.inf
opt = {"b":None, "r":None}
for r in range(1,k):
val = abs(thresh**r - r/k)
if val < loss:
loss = val
opt['b'] = round(k/r)
opt['r'] = r
This gives b=14 and r=7
but if I use optimal_param(0.7, 100)
then I get b=11, r=9
. If we look at loss:
# loss with datasketch optimal
r,k = 9,99
abs(thresh**r - r/k)
-> 0.050
# loss with own implementation
r,k = 7,98
abs(thresh**r - r/k)
-> 0.010
With the method I use, if I increase the budget to a very high number k = 100000000
, we can see that sigmoid curve gets steeper at the threshold.
Edit: I guess they are more or less same never mind this.
To your first point, I think you are right:
These minimum hashes are then partitioned into r buckets, with b hashes per bucket. These b hashes are augmented into a single value, then if two documents have the same value in at least one bucket, they’ll be marked as a potential match.
They are using r for number of bands, and b for number of rows.
The second point, I don't know how much approximation went into that threshold calculation. Here is a more detailed version of calculating false positive and false negative areas with that curve A Solution for Calculating the False Positive and False Negative in LSH Method to Find Similar Documents, which I believe is what is used in the datasketch's implementation.
It is worth noting that these numbers(loss, fpr, fnr) might be much higher in reality with actual datasets. There are many other factors that affect the results as well (the shingle size, tokenization method, etc). We have observed larger than 20% false positive and false negative rates even though the theoretical values are well below 5% in some deduplication experiments for The Stack.
Interesting, I will take a look at that paper. I believe it depends heavily on the data distribution you have and how you might define a false positive and a false negative. I think the approach they took in the Lee et al. paper seems like a good one. So we know that edit similarity > jaccard > minhash lsh in terms of being an approximation to near duplicates, but due to large scale of data it's not feasible to run edit simiarity. So we can take the largest possible sample form our dataset that we can afford and plot the distribution of edit similarity vs jaccard, then pick a good jaccard threshold and finally find optimal_param
for the budget we can afford. This seems to me like the way to go about picking parameters.
Btw thanks a lot for Connected Components in MapReduce and Beyond
in pyspark really appreciate it :) Not sure what this paper brings in but I will need some more time to read it anyways.
Yes, you are absolutely right, and that's precisely what we did for those parameters for The Stack.
I went with Connected Components in MapReduce and Beyond instead of the one you mentioned because there were just more references about it that I can find😆
I am closing this due to inactivity. Feel free to reopen if you have additional questions.
Hey, how do the following arguments for MinHash:
relate to the parameters of the Lee et al. paper? In particular, is
num_perm
thek
parameter of Appendix A? (how do you setr
andb
then?)If one wanted to have the exact parameters that this paper used, is there an example somewhere?