DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Domain-level paragraph deduplication #3

Closed DavidNemeskey closed 5 years ago

DavidNemeskey commented 5 years ago

This pull request performs domain-level deduplication of frequent paragraphs. And possibly much more, it kind of got out of hand...