How to get a listing of WARC/WAT/WET files using HTTP for News Dataset ?

commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC

Apache License 2.0

316 stars 34 forks source link

Closed brand17 closed 3 years ago

brand17 commented 3 years ago

I can obtain listing for Common crawl by:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz

How can I do this with Common Crawl News Dataset ?

sebastian-nagel commented 3 years ago

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

or for a subset:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/2017/09/

brand17 commented 3 years ago

Thanks