DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Get rid of bootstrapping #11

Open DavidNemeskey opened 4 years ago

DavidNemeskey commented 4 years ago

Make the code simpler by getting rid of the bootstrapping. In that way, it will only work when a corpus is created from scratch, but it will be much simpler and the deduplication much less memory-hungry.