ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
588 stars 69 forks source link

text-dedup #96

Open HungHoangDinh opened 1 month ago

HungHoangDinh commented 1 month ago

I have dedup my dataset. However, I cannot know which data is discarded and which data it overlaps with. How can I check? Hope you can help me!

ChenghaoMou commented 1 month ago

You can do so by modifying the code for filtering (e.g. in minhash.py)

Duplicates are clustered and by default only the one with the index == cluster id is kept. A union find object is used to find the corresponding cluster id for each index. You can save the uf object and dataset halfway so that you can debug them.

HungHoangDinh commented 1 month ago

I want to check which sentences in the dataset the eliminated sentences match. Can you help me to do this?

ChenghaoMou commented 1 month ago

You just need to save the dataset after this line, assuming you are familiar with Huggingface's datasets library.

The dataset saved will contain the cluster id. You can iterate each cluster to see duplicates.

HungHoangDinh commented 1 month ago

My text set has 700 thousand sentences. After using text-dedup, there are only 600 sentences left. I want to increase the number of sentences retained. I used minhash to increase the threshold and num_perm, but the results did not change. Can you help me solve this problem?

ChenghaoMou commented 1 month ago

Here is some information you could provide to facilitate this conversion:

  1. Have you tried the suggestions I provided above? If so, what issues did you see from the results?
  2. Have you read the code to make sure it suits your dataset?
  3. Could you provide more information about your dataset (such as example set, language, length) and command used so I can reproduce?