JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
198 stars 76 forks source link

Upgrade to 2.0.1 issues with S3 #54

Closed kyprifog closed 3 years ago

kyprifog commented 3 years ago

On upgrading to 2.0.1 I can no longer leverage hadoop-aws and get this somewhat cryptic error:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_key)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey",secret_key)

querying this public bucket:

spark.read.csv(s3a://nyc-tlc/misc/uber_nyc_data.csv")
py4j.protocol.Py4JJavaError: An error occurred while calling o74.csv.
: java.nio.file.AccessDeniedException: s3a://nyc-tlc/misc/uber_nyc_data.csv: getFileStatus on s3a://nyc-tlc/misc/uber_nyc_data.csv: com.amazonaws.services.s3.model.AmazonS3Exception: 
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230)
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2198)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
    at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1700)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:2995)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:723)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1271)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1249)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1246)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2183)
    ... 22 more

How do I configure hadoop-aws like version 1.0.0? This seemed to be working then.

kyprifog commented 3 years ago

@jahstreet reopening because this is still an issue for me.

kyprifog commented 3 years ago

@jahstreet Do you have any ideas on this error? I verified that hadoop-aws-3.2.0 was indeed included in the livy image that you had built, but I am having a really tough time diagnosing this particular error. Are you able to show s3 access from the most recent version?

kyprifog commented 3 years ago

Here is the scala version of the error:

java.nio.file.AccessDeniedException: s3a://nyc-tlc/misc/uber_nyc_data.csv: getFileStatus on s3a://nyc-tlc/misc/uber_nyc_data.csv: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 32ED94CC70510060; S3 Extended Request ID: FuZ1ybBdZzDIY5Vm8mQYBtdYzg63nva1MOwYK+wQQngI+DL57hqCC0ctYwUqCnb6NdDJ7J/1og8=), S3 Extended Request ID: FuZ1ybBdZzDIY5Vm8mQYBtdYzg63nva1MOwYK+wQQngI+DL57hqCC0ctYwUqCnb6NdDJ7J/1og8=:403 Forbidden
  at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230)
  at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2198)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
  at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1700)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:2995)
  at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:723)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:553)
  ... 51 elided
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 32ED94CC70510060; S3 Extended Request ID: FuZ1ybBdZzDIY5Vm8mQYBtdYzg63nva1MOwYK+wQQngI+DL57hqCC0ctYwUqCnb6NdDJ7J/1og8=)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
  at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1271)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1249)
  at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
  at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1246)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2183)
  ... 63 more
kyprifog commented 3 years ago

If I go in and manually add the AWS keys to the environ it seems to fix it so the issue appears to be some disconnect with how hadoop-aws and spark 3.0 are supposed to forward keys.

kyprifog commented 3 years ago

This says that spark-submit normally is responsible for forwarding the keys so this makes me think that the corresponding change needs to happen on the livy side: https://spark.apache.org/docs/latest/cloud-integration.html#authenticating

kyprifog commented 3 years ago

I may have found a work around by just setting:

"spark.executorEnv.AWS_SECRET_ACCESS_KEY"
"spark.executorEnv.AWS_ACCESS_KEY_ID"

in the spark conf

kyprifog commented 3 years ago

It turns out after much digging that forwarding via the spark conf in the usual way actually works as expected:

val spark = SparkSession.builder.
    .config("fs.s3a.secret.key", "TEMP_SECRET")
    .config("fs.s3a.access.key", "TEMP_KEY")
    .config("fs.s3a.session.token", "TEMP_SESSION")
    .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .getOrCreate

There were a couple of other confounding issues causing it to error in my case. Closing issue.