Provide indexing outside AWS

commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC

Apache License 2.0

316 stars 34 forks source link

Provide indexing outside AWS #17

Closed john-hewitt closed 7 years ago

john-hewitt commented 7 years ago

For the time being, I do not have AWS credentials. This means I'm unable to determine the filenames for the crawl dumps. Could the directory structure be made available by methods other than aws ls?

I might be wrong, but I don't think aws ls can be run anonymously.

sebastian-nagel commented 7 years ago

The only requirement is to install the AWS Command Line Interface locally on your machine, then --no-sign-request allows for anonymous access:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/

To download the files add to every path in the listing as prefix s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/.

john-hewitt commented 7 years ago

Awesome; you're right. Thanks.

sebastian-nagel commented 7 years ago

No problem, I take this as an important point to be added to the get-started page, we also need to mention the news crawl there. Thanks!