commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

How to get a listing of WARC/WAT/WET files using HTTP for News Dataset ? #45

Closed brand17 closed 3 years ago

brand17 commented 3 years ago

I can obtain listing for Common crawl by:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz

How can I do this with Common Crawl News Dataset ?

sebastian-nagel commented 3 years ago

See also same question on stackoverflow and the news data release announcement:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

or for a subset:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/
brand17 commented 3 years ago

Thanks