ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
611 stars 70 forks source link

MinHash dedup parameters #8

Closed TevenLeScao closed 1 year ago

TevenLeScao commented 1 year ago

Hey, how do the following arguments for MinHash:

python -m text_dedup.minhash \
    --ngram 1 \
    --num_perm 128 \
    --threshold 0.8 \

relate to the parameters of the Lee et al. paper? In particular, is num_perm the k parameter of Appendix A? (how do you set r and b then?)

If one wanted to have the exact parameters that this paper used, is there an example somewhere?

ChenghaoMou commented 1 year ago

They are different technically, but effectively the same. $k$ in that paper refers to number of hash functions, while number of permutations in datasketch refers to another version of MinHash using just one hash function.

Variant with many hash functions VS Variant with a single hash function

Also in datasketch, r and b are calculated for optimal FN and FP rates (with equal weights by default). see here for the optimal calculation.

You can update the code to set both b and r, see source

There isn't a direct way to reproduce that paper if I read it correctly as it involves using edit similarity as a second step to reduce false positives, which is not included in this repo but should be easy to add.

ChenghaoMou commented 1 year ago

There is also something else that is unclear from the paper — what they do with a connected component of duplicates.

ChenghaoMou commented 1 year ago

364,613,570 records in C4 isn't something I think the single script can handle. I believe Google used some versions of distributed clusters and implementations for both that paper and the SimHash one.

ChenghaoMou commented 1 year ago

Even if it scales linearly, my script (the fastest one is in BigCode) handles 15M docs in 3.5 hrs, which means it will take 85 hours to dedup C4.

TevenLeScao commented 1 year ago

Thanks for the quick answer! I assume that fastest script you mentioned is the one at https://github.com/bigcode-project/bigcode-analysis/pull/11, right? How large was data/python in this?

ChenghaoMou commented 1 year ago

Here https://github.com/ChenghaoMou/bigcode-analysis/blob/minhash_improve/data_analysis/near-deduplication/minhash_deduplication_alt.py I haven't merged it yet.

Here are the results on data/python using an 80-core machine (memory consumption ~300 GB):

[12/09/22 20:24:44] INFO     load_dataset                    : 28.08 seconds                                                                                        minhash_deduplication_alt.py:167
                    INFO     minhash                         : 3689.60 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     clustering                      : 6322.42 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     filtering                       : 2235.21 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     save                            : 1478.33 seconds                                                                                      minhash_deduplication_alt.py:167
                    INFO     Data Number (before)            : 15148604                                                                                             minhash_deduplication_alt.py:168
                    INFO     Data Number (after)             : 13032937 (86.03%)                                                                                    minhash_deduplication_alt.py:169
                    INFO     Duplicate Number                : 2115667 (13.97%)                                                                                     minhash_deduplication_alt.py:170
                    INFO     Total Time                      : 13753.72 seconds                                                                                     minhash_deduplication_alt.py:171
                    INFO     Deduplicated Dataset            : results/output/deduplicated                                                                          minhash_deduplication_alt.py:172
                    INFO     🤗 Happy Deduplicating 🤗                                                                                                              minhash_deduplication_alt.py:173
TevenLeScao commented 1 year ago

How big was that dataset? Would it scale linearly to, say, 1T?

ChenghaoMou commented 1 year ago

Okay, I see what you mean. The physical size (bytes) usually matters less than the number of docs for these methods. You could have 800GB of java data in 42M files while 24GB with 15M python files.

ChenghaoMou commented 1 year ago

Based on my experiments, it is roughly linear in terms of number of files.

ChenghaoMou commented 1 year ago

Do you have a hf dataset link? Maybe I can give it a try to see how long it will take.

KeremTurgutlu commented 1 year ago

I am also interested in reproducing the paper or at least doing large scale deduplication if not fully implement all the details. I guess pyspark is the best cost/time efficient solution here?

ChenghaoMou commented 1 year ago

Yes, the pyspark version should handle TB level datasets well given enough compute, but it won't be an exact replica of what was used in the paper.

Here is an early analysis on the scaling property on The Stack dataset:

CleanShot 2023-03-08 at 21 07 34

KeremTurgutlu commented 1 year ago

I wonder if Lee et al. paper has the r and b reversed in the paper by accident:

 After some experimentation, we
chose to use b = 20, and r = 450, so k = 9, 000,
so as to make sure a collision at the desired Jaccard
index threshold of 0.8 had a high probability of
occurring.

1 - (1 - thresh**r)**b would yield ~0, but if you flip it then it is ~1. Which makes sense higher the number of bands there is greater chance of at least one collision.

Also I am suspicious about the optimal r and b values set by datasketch.

For example for a given budget (k) and target threshold the following gives the r,b values where we have the steepest slope at the point=threshold:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

"The threshold is roughly where the rise is
the steepest, and for large b and r there we find that pairs with similarity
above the threshold are very likely to become candidates, while those below the
threshold are unlikely to become candidates – exactly the situation we want."

An approximation to the threshold is (1/b)^(1/r). # loss I used below to solve for r and b
thresh = 0.7 # target threshold from data (e.g. 0.7 is a good similarity that i would like to look at some clusters for deduping my dataset)
k = 100 # budget

loss = np.inf 
opt = {"b":None, "r":None}
for r in range(1,k):
    val = abs(thresh**r - r/k)
    if val < loss:
        loss = val
        opt['b'] = round(k/r)
        opt['r'] = r

This gives b=14 and r=7 but if I use optimal_param(0.7, 100) then I get b=11, r=9. If we look at loss:

# loss with datasketch optimal
r,k = 9,99
abs(thresh**r - r/k)
-> 0.050
# loss with own implementation
r,k = 7,98
abs(thresh**r - r/k)
-> 0.010

With the method I use, if I increase the budget to a very high number k = 100000000, we can see that sigmoid curve gets steeper at the threshold.

minhas

Edit: I guess they are more or less same never mind this.

ChenghaoMou commented 1 year ago

To your first point, I think you are right:

These minimum hashes are then partitioned into r buckets, with b hashes per bucket. These b hashes are augmented into a single value, then if two documents have the same value in at least one bucket, they’ll be marked as a potential match.

They are using r for number of bands, and b for number of rows.

The second point, I don't know how much approximation went into that threshold calculation. Here is a more detailed version of calculating false positive and false negative areas with that curve A Solution for Calculating the False Positive and False Negative in LSH Method to Find Similar Documents, which I believe is what is used in the datasketch's implementation.

It is worth noting that these numbers(loss, fpr, fnr) might be much higher in reality with actual datasets. There are many other factors that affect the results as well (the shingle size, tokenization method, etc). We have observed larger than 20% false positive and false negative rates even though the theoretical values are well below 5% in some deduplication experiments for The Stack.

KeremTurgutlu commented 1 year ago

Interesting, I will take a look at that paper. I believe it depends heavily on the data distribution you have and how you might define a false positive and a false negative. I think the approach they took in the Lee et al. paper seems like a good one. So we know that edit similarity > jaccard > minhash lsh in terms of being an approximation to near duplicates, but due to large scale of data it's not feasible to run edit simiarity. So we can take the largest possible sample form our dataset that we can afford and plot the distribution of edit similarity vs jaccard, then pick a good jaccard threshold and finally find optimal_param for the budget we can afford. This seems to me like the way to go about picking parameters.

Btw thanks a lot for Connected Components in MapReduce and Beyond in pyspark really appreciate it :) Not sure what this paper brings in but I will need some more time to read it anyways.

ChenghaoMou commented 1 year ago

Yes, you are absolutely right, and that's precisely what we did for those parameters for The Stack.

I went with Connected Components in MapReduce and Beyond instead of the one you mentioned because there were just more references about it that I can find😆

ChenghaoMou commented 1 year ago

I am closing this due to inactivity. Feel free to reopen if you have additional questions.