DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Batch indexer #29

Closed DavidNemeskey closed 1 year ago

DavidNemeskey commented 1 year ago

New script batch_deduplicate_index_urls.py added that runs through the list of directories in the index_filtered directory and iteratively deduplicates them.

DavidNemeskey commented 1 year ago

It works.