DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Deduplication improvements #34

Closed acheronw closed 1 year ago

acheronw commented 1 year ago

Wrote a script that deduplicates (using the minhashes) against every earlier batch (treating directory names as dates).

DavidNemeskey commented 1 year ago

@acheronw Please go through the unresolved threads and resolve them if you think you have addressed the associated comments. GitHub won't allow the merge if any remains.

The main outstanding issue is the that we also need the script that actually runs the whole cumulative deduplication.