Closed john-hewitt closed 7 years ago
The only requirement is to install the AWS Command Line Interface locally on your machine, then --no-sign-request
allows for anonymous access:
aws --no-sign-request s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/
To download the files add to every path in the listing as prefix s3://commoncrawl/
or https://commoncrawl.s3.amazonaws.com/
.
Awesome; you're right. Thanks.
No problem, I take this as an important point to be added to the get-started page, we also need to mention the news crawl there. Thanks!
For the time being, I do not have AWS credentials. This means I'm unable to determine the filenames for the crawl dumps. Could the directory structure be made available by methods other than
aws ls
?I might be wrong, but I don't think
aws ls
can be run anonymously.