apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.42k stars 2.22k forks source link

Using Iceberg from EKS to access resource in another aws account loads instance role by default #7344

Closed raviranak closed 1 month ago

raviranak commented 1 year ago

Query engine

Using Iceberg from EKS to access resource in another aws account loads instance role by default

Question

Using Iceberg from EKS to access resource in another aws account loads instance role by default Currently iceberg just have AssumeRoleAwsClientFactory and with our current setup this we cannot leverage due to multiple nodegroup in eks to provide for assume role access for another aws account. What we wanted is to load the service role that loads via WebIdentityTokenCredentialsProvider for iceberg . Could you please help in solution here

stevenzwu commented 1 year ago

@raviranak can you check if it is the same issue as https://github.com/apache/iceberg/issues/6715. Latest 1.2.0 Iceberg release should have included the fix.

raviranak commented 1 year ago

Hi @stevenzwu

Here is my spark-context used from pyspark.sql import SparkSession

`

spark = SparkSession.builder \ .appName("MyApp") \ .config("spark.sql.hive.metastore.glueCatalog.enabled", "true") \ .config("spark.sql.catalog.iceberg_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \ .config("spark.sql.catalog.iceberg_catalog.warehouse", "s3://internal/iceberg/warehouse/") \ .config("spark.sql.catalog.iceberg_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \ .config("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalogImplementation", "hive") \ .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \ .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \ .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.1,org.apache.spark:spark-avro_2.12:3.2.0," "org.apache.hadoop:hadoop-aws:3.3.1," "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.2.0") \ .config("spark.jars", "/home/ray/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.0.jar," "/home/ray/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.901.jar," "/home/ray/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.3.1.jar," "/home/ray/.ivy2/jars/org.apache.iceberg_iceberg-spark-runtime-3.2_2.12-1.2.0.jar," "https://internal.s3.amazonaws.com/iceberg/bundle-2.17.131.jar," "https://internal.s3.amazonaws.com/iceberg/url-connection-client-2.17.131.jar") \ .config("spark.hadoop.fs.s3a.canned.acl", "BucketOwnerFullControl") \ .config("spark.hadoop.hive.metastore.glue.catalogid", "123456789") \ .getOrCreate() `

still facing this issue software.amazon.awssdk.services.glue.model.AccessDeniedException: User: arn:aws:sts:::assumed-role/clusteri-07d3180159a814e31 is not authorized to perform: glue:GetTable on resource:

Can you please here as it seems role doesn't resolve to service role

raviranak commented 1 year ago

Using Spark 3.3.0 and .iceberg_iceberg-spark-runtime-3.2_2.12-1.2.0 [iceberg 1.2.0]

raviranak commented 1 year ago

can you help here how to change the provider to use WebIdentityTokenFileCredentialsProvider for iceberg client

MarquisC commented 1 year ago

hey @raviranak @stevenzwu what we're seeing something similar EKS as well via the Iceberg Flink path (wanted to get your thoughts):

The default credential provider should have by precedent attempted to leverage the WebIdentity path right (before it defaults to the EC2 instance role)?

If I kubectl exec in and install the aws-cli within the container, the the result of aws sts get-caller-identity correctly identifies the hierarchy and selects the kubernetes service account -> IAM role (WebIdentity Path)

I couldn't find an easy way in the time that I looked to directly influence/configure the Glue client within the Iceberg lib (probably user error on my part).

What I ended up doing was just letting the EC2 instance role assume the role it needs via:

Create Catalog ...
'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',  'io-impl'='org.apache.iceberg.aws.s3.S3FileIO', 'client.assume-role.region' = 'us-east-1', 'client.factory' = 'org.apache.iceberg.aws.AssumeRoleAwsClientFactory', 'client.assume-role.arn' = 'arn:aws:iam::${aws account number}:role/${the role that should of worked from web identity perms')"}"

The particular use case where it wasn't working was enabling Flink Session clusters on kubernetes/eks and the Flink SQL Gateway to chat with the Glue Data Catalog correctly.

Our fat jar Flink jobs successfully leverage the WebIdentity path (we allow our jobs to dynamically create the tables and databases in glue if they don't exist) where the deps include the normal aws sdk and the url-connection-client.

Hopefully this is helpful for you @raviranak

-- Edit --

@raviranak you can't include the url connection lib jar and the aws sdk bundle (that also includes the third party dep this would satisfy) when they're both on the classpath they conflict.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 month ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'