facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Change the release format for smaller disk usage #13

Closed gwenzek closed 4 years ago

gwenzek commented 4 years ago

The first version was released with the documents grouped by language. The problem is that the the Common Crawl is not released by language, which forced to download the web documents out of order, and putting too much pressure on the disk.

This version store the metadata in the same order than the original document. It make the splitting by language a bit more complex, but it should be better overall.