DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Train / test sampling #63

Closed DavidNemeskey closed 4 months ago

DavidNemeskey commented 4 months ago

Added a train/testset splitter + unified BatchWriter.