commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
107 stars 9 forks source link

Add AWS authentication for downloading data #18

Closed aliebrahiiimi closed 2 years ago

aliebrahiiimi commented 2 years ago

Is it possible to add Amazon Web Services authentication to the download of data? You described a solution based on Spark that I used. It worked fine, but now it needs authentication. What are the steps for adding authentication?

sebastian-nagel commented 2 years ago

Hi @aliebrahiiimi, could you provide more context what you want to achieve or what code you are running? Is the issue related to the project cc-index-table? For more information about the new scheme to access Common Crawl data, see

aliebrahiiimi commented 2 years ago

Hi @sebastian-nagel , thanks for your response, yes I want to download cc-index-table. I have downloaded the index of each file, and now I want to download the folder of each index by using spark-submit, but I can't due to the new rule for amazon authentication. I have used this tutorial for download: https://github.com/commoncrawl/cc-index-table. my script: spark-submit --driver-memory 50g --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR --csv ../csvs/CC-MAIN-2018-39 --numOutputPartitions 1000 --numRecordsPerWarcFile 10000 --warcPrefix persian-CC s3://commoncrawl/cc-index/table/cc-main/warc/ ../data/CC-MAIN-2018-39/ result i have got:

22/04/24 09:39:36 ERROR CCIndexExport: Failed to fetch s3://commoncrawl/crawl-data/CC-MAIN-2018-39/segments/1537267159820.67/warc/CC-MAIN-20180923212605-20180923233005-00368.warc.gz (bytes = 846547810-846567351): {}
org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>5W4P5JSKFH54AQZT</RequestId><HostId>0MdeqFM3kedGGBAnsU+VEN20AkLKaG+YKB4AstDnb3jxdsLRz5igBRjmrlcoUrLuBx4VmTYqnjY=</HostId></Error>
    at org.jets3t.service.S3Service.getObject(S3Service.java:2678)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.getCCWarcRecord(CCIndexWarcExport.java:120)
    at org.commoncrawl.spark.examples.CCIndexWarcExport.lambda$run$910c4c43$1(CCIndexWarcExport.java:183)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsToPair$2(JavaRDDLike.scala:209)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
sebastian-nagel commented 2 years ago

Hi @aliebrahiiimi, thanks - I can reproduce the problem and working on a fix. I'll likely now definitely switch from jets3t to the AWS SDK (see #3) because the latter one supports various authentication methods (EC2 IAM roles, environment variables, credentials file, Java properties - see https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/ec2-iam-roles.html).

sebastian-nagel commented 2 years ago

Hi @aliebrahiiimi, could you give #19 a try? You need to configure the AWS credentials via IAM roles, a credentials file, env vars, etc. - see the link in the PR. Let me know whether this works.

Note: if unauthenticated access is mandatory this could be implemented as well, at least, for fetching WARC records from https://data.commoncrawl.org/.

aliebrahiiimi commented 2 years ago

@sebastian-nagel Thank you very much, yes, it works for me.

sebastian-nagel commented 2 years ago

Thanks for the notice, @aliebrahiiimi! Going forward to merge #19.