facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Variance of hash files sizes in newer crawls #27

Open var926 opened 3 years ago

var926 commented 3 years ago

Hello, I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of April and May 2019, even though original wet files were the same size.

In both cases there are 2 shards per one hash and all the other parameters are the same.

Trying to understand why, tnx:)

chirico85 commented 2 years ago

Same here, but for dump 22-05 :) And each of my *_log.err files reach sizes of 5GB showing repeteadly this (which might be the reason for small hashes size) Message: "Can't parse header:". It is probably related to #16 . I did not found any solution yet.