Open HungHoangDinh opened 1 month ago
You can do so by modifying the code for filtering (e.g. in minhash.py)
Duplicates are clustered and by default only the one with the index == cluster id is kept. A union find object is used to find the corresponding cluster id for each index. You can save the uf object and dataset halfway so that you can debug them.
I want to check which sentences in the dataset the eliminated sentences match. Can you help me to do this?
My text set has 700 thousand sentences. After using text-dedup, there are only 600 sentences left. I want to increase the number of sentences retained. I used minhash to increase the threshold and num_perm, but the results did not change. Can you help me solve this problem?
Here is some information you could provide to facilitate this conversion:
I have dedup my dataset. However, I cannot know which data is discarded and which data it overlaps with. How can I check? Hope you can help me!