ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
603 stars 69 forks source link

the ngram setting of minhash #17

Closed liujuncn closed 1 year ago

liujuncn commented 1 year ago

I built a very small data set to test minhash deduplication, the data is as follows:

{'text': 'Farm to Market Road 1506 (FM 1506) is located in Lamar County.'} {'text': 'Farm to Market Road 1507 (FM 1507) is located in Lamar County.'} {'text': 'Farm to Market Road 1508 (FM 1508) is located in Lamar County.'} {'text': 'Farm to Market Road 1514 (FM 1514) is located in San Jacinto County.'} {'text': 'Farm to Market Road 1511 (FM 1511) is located in Leon County.'} {'text': 'Farm to Market Road 1512 (FM 1512) is located in Leon County.'} {'text': 'Farm to Market Road 1513 (FM 1513) is located in Rusk County.'} {'text': 'Farm to Market Road 1503 (FM 1503) is located in Lamar County.'} {'text': 'Farm to Market Road 1504 (FM 1504) is located in Van Zandt County.'} {'text': 'Farm to Market Road 1505 (FM 1505) was located in El Paso County.'}

When using the default settings, the result is the same as the input. It started reducing duplicates until I set ngram to 2.

{'text': 'Farm to Market Road 1506 (FM 1506) is located in Lamar County.'} {'text': 'Farm to Market Road 1508 (FM 1508) is located in Lamar County.'} {'text': 'Farm to Market Road 1514 (FM 1514) is located in San Jacinto County.'} {'text': 'Farm to Market Road 1511 (FM 1511) is located in Leon County.'} {'text': 'Farm to Market Road 1513 (FM 1513) is located in Rusk County.'} {'text': 'Farm to Market Road 1504 (FM 1504) is located in Van Zandt County.'} {'text': 'Farm to Market Road 1505 (FM 1505) was located in El Paso County.'}

ngram is set to 2, is it reasonable for most cases? Are there other details I haven't noticed?

liujuncn commented 1 year ago

I successfully did text deduplication based on your code, like this:

output_ds = dedup(input_ds, params)

Now I have another question I want to ask for advice. Now deduplication is for comparison between each sample in the dataset. How to deduplicate the text in the sample? For example, in the news report, it always starts with a certain news organization. In this case, we want to delete it. Otherwise, when training the language model, it will become a prompt word.

ChenghaoMou commented 1 year ago

ngram is set to 2, is it reasonable for most cases? Are there other details I haven't noticed? There are no universally applicable values for those parameters, unfortunately. I recommend experimenting with an actual subset of your data with a reasonable size (>10000) to find proper settings. Short text, in general, will be benefit from a small ngram size.

Please feel free to change the code. It is meant to be modified for different datasets or use-cases.

How to deduplicate the text in the sample? For example, in the news report, it always starts with a certain news organization. In this case, we want to delete it. Otherwise, when training the language model, it will become a prompt word.

You can look into the suffix array substring deduplication, which removes duplicated sub-strings within the whole text dataset. It is useful when you have a lot of template-based documents.

You can also flatten the document into paragraphs and do paragraph-level deduplication.

liujuncn commented 1 year ago

@ChenghaoMou Can't find scripts/make_suffix_array.py in the repo.

__run_command( f"python scripts/make_suffix_array.py {temp_text}", args.google_repo_path, )

liujuncn commented 1 year ago

Add it from other repo, but get error:

image

ChenghaoMou commented 1 year ago

To run the google repo, you need to configure the rust dependencies first. Please follow their instructions first.

ChenghaoMou commented 1 year ago

Closing it for now, feel free to open an issue if you have any other questions.