DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Faster download #23

Closed DavidNemeskey closed 1 year ago

DavidNemeskey commented 2 years ago

The current downloader takes the list of pages from the index and downloads them one-by-one from CC. In May 2022, there are approximately 17M pages to download from 80k WARC files. This translates to 17M HTTP requests, each of which asks for a byte range from a WARC file. It took the download step about 54 hours to complete on 250 threads.

Since HTTP (and CloudFront) allows requests to specify multiple byte ranges, we could do one request per WARC file, cutting down the number of requests and hopefully, the time to download the data substantially. An experiment with a single WARC file (224 URLs) proved the concept the time required from 1:30 to just over 1 second.

DavidNemeskey commented 1 year ago

Resolved vai #27.