Faster download - Githubissues

The current downloader takes the list of pages from the index and downloads them one-by-one from CC. In May 2022, there are approximately 17M pages to download from 80k WARC files. This translates to 17M HTTP requests, each of which asks for a byte range from a WARC file. It took the download step about 54 hours to complete on 250 threads.

Since HTTP (and CloudFront) allows requests to specify multiple byte ranges, we could do one request per WARC file, cutting down the number of requests and hopefully, the time to download the data substantially. An experiment with a single WARC file (224 URLs) proved the concept the time required from 1:30 to just over 1 second.

DavidNemeskey / cc_corpus

Faster download #23