facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
932 stars 138 forks source link

Dedup all paragraphs if it appear more than once? #8

Closed xingenju closed 4 years ago

xingenju commented 4 years ago

eg. if "it is an issue about cc_net" is a paragraph and it appeared three times, as the NativeHashSet saves the value of this key is 1, the 3 paragraphs will be dropped. Why not save one copy?

gwenzek commented 4 years ago

The decision may seems weird, but when looking at the data it appeared that most duplicate paragraph was either:

Therefore I chose to not include them. But I agree this decision isn't really documented in the paper, and could be worth some experiments.

gwenzek commented 4 years ago

Closing, feel free to reopen if I haven't fully answered your questions.