DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

No cookie #55

Closed DavidNemeskey closed 1 year ago

DavidNemeskey commented 1 year ago

The autonomous_cross_deduplicator2 is not ideal, since many lines of code are duplicated. On a later day we should merge the two together and an input parameter should tell the method in which way should it function (read all to memory, or not).

Yes, definitely, it is just a stopgap measure. The final version should be able to read a specified number of directories (or all, or just one) into memory at the same time to accommodate for the resources available on the machine it is run on.