Open sebastian-nagel opened 5 years ago
Hi Sebastian,
Thanks a lot for mentioning the issue and suggesting warcio. I thought the files are ordered alphabetically by url.. I will update my script later.
Best.
Inside a single file (WARC/WAT/WET) records are ordered by URL. But the set of page captures contained in one file is random, see this discussion.
Hi @Marlin-Na,
while searching for examples how Common Crawl data is used, I stumbled over this nice project and just looked at the following comments: https://github.com/Marlin-Na/CommonCrawlDL/blob/aadf36f45d99f1f563d96baa6ae4267e7de17c5a/ccdownload.py#L18
Common Crawl WARC files (also WAT and WET) are already sampled by distributing records over files using a pseudo-random hash of the URL. Of course, it's fine to sample again (to get more random). However, downloading 1000 files and using only 0.25% of the content seems somehow a waste of resources.
WET files are a special form of WARC files, so you may want to use a WARC library to do the parsing. I'd recomment warcio. Processing WET files could be done that way:
Best, Sebastian