Closed burf2000 closed 6 years ago
You can just download them with any HTTPClient, no need for specialized functionality as far as I see.
The master list is located at https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wet.paths.gz and https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wat.paths.gz, these contain a text-file with a list of all files that can be accessed for a certain crawl.
See the blog http://commoncrawl.org/connect/blog/ for the latest available crawls and up-to-date links to these files.
Thank you, sorry my question was a bit poo. I mean is there a way to modify your tool (I am about to fork it) so that instead of writing the page to disk it could write the wet or wat file for it, which are a lot smaller. I did try and change the urls it returns (warc) but I think the offsets are then wrong
I don't know of a simple way as for the newer indices this tool actually uses the index-files (CDX) and not WET/WAT directly for processing, so there is little code related to actually handling WAT/WET files, only some ARC/WARC record handling which is originally coming from somewhere else anyway.
You ok if I upload a version of this that looks for 200 status of websites and chucks them in a DB, I will link to this repo
No problem.
Is there anyway to download the Wet or Wat files