commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

CCIndexWarcExport: replace jets3t by AWS SDK (#3), access s3://commoncrawl/ with authentication #19

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

This PR addresses:

  1. replace jets3t with AWS SDK v2 to fetch WARC records from s3://commoncrawl/ (#3)
  2. implement authenticated access to s3://commoncrawl/ (fixes #18): instead of explicitly requesting data from s3://commoncrawl/ without authentication, rely on the default credential provider chain of the AWS SDK