The first version was released with the documents grouped by language.
The problem is that the the Common Crawl is not released by language, which forced to download the web documents out of order, and putting too much pressure on the disk.
This version store the metadata in the same order than the original document.
It make the splitting by language a bit more complex, but it should be better overall.
The first version was released with the documents grouped by language. The problem is that the the Common Crawl is not released by language, which forced to download the web documents out of order, and putting too much pressure on the disk.
This version store the metadata in the same order than the original document. It make the splitting by language a bit more complex, but it should be better overall.