ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
612 stars 71 forks source link

How to get duplicates cluster ids? #18

Closed konradkalita closed 1 year ago

konradkalita commented 1 year ago

Hi, my use case is that I would not only like to remove duplicates from a dataset but also do some analytic on what was clustered as duplicates. So in the result I would like to have table with colums example_id, cluster_id. Is it possible with current code? If not what would be the best place to add that feature?

ChenghaoMou commented 1 year ago

It is possible, but you might need some code changes. For example, in minhash.py, you can see that each item is assigned a cluster id, you can save the entire dataset ds, which has the cluster id, by adding one line afterward:

ds = ds.map(
    function=lambda _, idx: {"__cluster__": uf.find(idx)},
    with_indices=True,
    num_proc=os.cpu_count(),
    new_fingerprint=str(random.getrandbits(128)),
    desc="Finding clusters...",
)
ds.save_to_disk("OUTPUT_DIR")
konradkalita commented 1 year ago

Thanks, that works for me. Actually I wanted to ask you do you have any recommendations for setting minhash parameters to decrease the amount of space which is needed to store hashes? I have a collection of about 200 million of short texts which occupies around 10GB of disk space, but running minhash pipeline generates about 500GB of hashes which seems crazy for me. I guess I should tune num_perm and ngram parameters but I have no idea what values should be especially for num_perm.

BTW: Thank you for your work on this project.

ChenghaoMou commented 1 year ago

I would suggest experimenting with a small portion (as much as you can comfortably handle) of your data to figure out some good settings. For short text and limited space, here are some intuitions I have:

  1. you can start with a small number of permutations like 20.
  2. you can actually start with unigram, and increase when needed (e.g. seeing too many false positives from your experiments), but this will unlikely to impact the memory/storage.
  3. If you haven't already, remove exact duplicates (e.g. based on alphabetic content) first might help.

Short text is challenging in its own way and unfortunately I don't have much experience in that. Please feel free to share any issues you may encounter, I am more than happy to help.

ChenghaoMou commented 1 year ago

Closing this for now, feel free to open another issue if you have any more questions.