DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

AWS authentication needed #22

Closed botondbarta closed 1 year ago

botondbarta commented 2 years ago

Since April using the S3 API to access data from the Amazon cloud requires authentication. So unsigned access to the CommonCrawl is disabled, therefore the _downloadpages.py script is not working because of the unsigned config. Removing the Config is enough to make it work.

https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/

DavidNemeskey commented 2 years ago

@Baaart25 Thanks for reporting, I have just discovered this myself. A fix is in the works, but I plan to ditch AWS if favor of CloudFront so that we don't need boto anymore.

DavidNemeskey commented 1 year ago

Resolved via #27.