centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
BSD 2-Clause "Simplified" License
63 stars 20 forks source link

Wet and Wat files #12

Closed burf2000 closed 6 years ago

burf2000 commented 6 years ago

Is there anyway to download the Wet or Wat files

centic9 commented 6 years ago

You can just download them with any HTTPClient, no need for specialized functionality as far as I see.

The master list is located at https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wet.paths.gz and https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-13/wat.paths.gz, these contain a text-file with a list of all files that can be accessed for a certain crawl.

See the blog http://commoncrawl.org/connect/blog/ for the latest available crawls and up-to-date links to these files.

burf2000 commented 6 years ago

Thank you, sorry my question was a bit poo. I mean is there a way to modify your tool (I am about to fork it) so that instead of writing the page to disk it could write the wet or wat file for it, which are a lot smaller. I did try and change the urls it returns (warc) but I think the offsets are then wrong

centic9 commented 6 years ago

I don't know of a simple way as for the newer indices this tool actually uses the index-files (CDX) and not WET/WAT directly for processing, so there is little code related to actually handling WAT/WET files, only some ARC/WARC record handling which is originally coming from somewhere else anyway.

burf2000 commented 6 years ago

You ok if I upload a version of this that looks for 200 status of websites and chucks them in a DB, I will link to this repo

centic9 commented 6 years ago

No problem.