Sampling Common Crawl WET records

sebastian-nagel commented 5 years ago

Hi @Marlin-Na,

while searching for examples how Common Crawl data is used, I stumbled over this nice project and just looked at the following comments: https://github.com/Marlin-Na/CommonCrawlDL/blob/aadf36f45d99f1f563d96baa6ae4267e7de17c5a/ccdownload.py#L18

Common Crawl WARC files (also WAT and WET) are already sampled by distributing records over files using a pseudo-random hash of the URL. Of course, it's fine to sample again (to get more random). However, downloading 1000 files and using only 0.25% of the content seems somehow a waste of resources.

WET files are a special form of WARC files, so you may want to use a WARC library to do the parsing. I'd recomment warcio. Processing WET files could be done that way:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file.wet.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'conversion':
            url = record.rec_headers.get_header('WARC-Target-URI')
            text = record.content_stream().read().decode('utf-8')

Best, Sebastian

Marlin-Na commented 5 years ago

Hi Sebastian,

Thanks a lot for mentioning the issue and suggesting warcio. I thought the files are ordered alphabetically by url.. I will update my script later.

Best.

sebastian-nagel commented 5 years ago

Inside a single file (WARC/WAT/WET) records are ordered by URL. But the set of page captures contained in one file is random, see this discussion.

Marlin-Na / CommonCrawlDL

Sampling Common Crawl WET records #1