Closed praveenr019 closed 1 year ago
Hi @praveenr019, given the error message "botocore.exceptions.NoCredentialsError: Unable to locate credentials": is the job run on a Spark cluster or on a single instance? If on a cluster: how are the credentials deployed to the cluster instances (eg. via IAM roles)?
--input_base_url https://data.commoncrawl.org/
If on a single instance: I haven't seen a credential error just because of processing more data. How are the credentials configured?
Thanks for the reply @sebastian-nagel. Yes, the job is run on a Spark cluster in AWS and the credentials are setup using IAM roles.
No glue what could be the reason. And never seen this.
My assumption is that in cluster mode, every Python runner is a separate process. This would exclude any concurrency issues while fetching the credentials (for example here).
To address the problem, I'd catch the NoCredentialsError along the ClientError (sparkcc.py, line 283), log the error, re-instantiate the S3 client and try the download a second time. Let me know if you need help to implement this. Otherwise, would be interesting to hear whether this solves the problem.
Closing for now. @praveenr019 let me know if this is still an issue!
Created a spark job subclassing CCSparkJob to retrieve html text data. This job is working when passing input file with <10 S3 warc paths, but throwing below error when running with around 100 S3 warc paths. Could you please share your thoughts on what must be is causing this.