DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Write a script that incrementally deduplicates indices #28

Closed DavidNemeskey closed 1 year ago

DavidNemeskey commented 1 year ago

... using deduplicate_index_urls.py. That script does the hard work; all that is needed to call it repeatedly, while updating the URL index after each call.