boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input

praveenr019 commented 2 years ago

Created a spark job subclassing CCSparkJob to retrieve html text data. This job is working when passing input file with <10 S3 warc paths, but throwing below error when running with around 100 S3 warc paths. Could you please share your thoughts on what must be is causing this.

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt3/yarn/usercache/hadoop/appcache/application_1654076989914_0005/container_1654076989914_0005_03_000005/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/mnt3/yarn/usercache/hadoop/appcache/application_1654076989914_0005/container_1654076989914_0005_03_000005/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt3/yarn/usercache/hadoop/appcache/application_1654076989914_0005/container_1654076989914_0005_03_000005/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/mnt3/yarn/usercache/hadoop/appcache/application_1654076989914_0005/container_1654076989914_0005_03_000005/__pyfiles__/sparkcc.py", line 355, in process_warcs
stream = self.fetch_warc(uri, self.args.input_base_url)
File "/mnt3/yarn/usercache/hadoop/appcache/application_1654076989914_0005/container_1654076989914_0005_03_000005/__pyfiles__/sparkcc.py", line 290, in fetch_warc
self.get_s3_client().download_fileobj(bucketname, path, warctemp)
File "/usr/local/lib/python3.7/site-packages/boto3/s3/inject.py", line 795, in download_fileobj
return future.result()
File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 103, in result
return self._coordinator.result()
File "/usr/local/lib/python3.7/site-packages/s3transfer/futures.py", line 266, in result
raise self._exception
File "/usr/local/lib/python3.7/site-packages/s3transfer/tasks.py", line 269, in _main
self._submit(transfer_future=transfer_future, **kwargs)
File "/usr/local/lib/python3.7/site-packages/s3transfer/download.py", line 357, in _submit
**transfer_future.meta.call_args.extra_args,
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 508, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 895, in _make_api_call
operation_model, request_dict, request_context
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 917, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 195, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 134, in create_request
operation_name=operation_model.name,
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 103, in handler
return self.sign(operation_name, request)
File "/usr/local/lib/python3.7/site-packages/botocore/signers.py", line 187, in sign
auth.add_auth(request)
File "/usr/local/lib/python3.7/site-packages/botocore/auth.py", line 407, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:954)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:287)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:133)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

sebastian-nagel commented 2 years ago

Hi @praveenr019, given the error message "botocore.exceptions.NoCredentialsError: Unable to locate credentials": is the job run on a Spark cluster or on a single instance? If on a cluster: how are the credentials deployed to the cluster instances (eg. via IAM roles)?

see https://github.com/commoncrawl/cc-pyspark#authenticated-s3-access-or-access-via-http
if not running on AWS: use --input_base_url https://data.commoncrawl.org/

sebastian-nagel commented 2 years ago

If on a single instance: I haven't seen a credential error just because of processing more data. How are the credentials configured?

praveenr019 commented 2 years ago

Thanks for the reply @sebastian-nagel. Yes, the job is run on a Spark cluster in AWS and the credentials are setup using IAM roles.

sebastian-nagel commented 2 years ago

No glue what could be the reason. And never seen this.

My assumption is that in cluster mode, every Python runner is a separate process. This would exclude any concurrency issues while fetching the credentials (for example here).

To address the problem, I'd catch the NoCredentialsError along the ClientError (sparkcc.py, line 283), log the error, re-instantiate the S3 client and try the download a second time. Let me know if you need help to implement this. Otherwise, would be interesting to hear whether this solves the problem.

sebastian-nagel commented 1 year ago

Closing for now. @praveenr019 let me know if this is still an issue!

commoncrawl / cc-pyspark

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input #32