apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.27k stars 2.18k forks source link

Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring #10078

Closed AwasthiSomesh closed 2 weeks ago

AwasthiSomesh commented 6 months ago

Hi Team ,

We are doing below code to access iceberg table from glue catalog and data storage as S3

var spark = SparkSession.builder().master("local[*]") .config("spark.sql.defaultCatalog", "AwsDataCatalog") .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.AwsDataCatalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.AwsDataCatalog.io-imp", "org.apache.iceberg.aws.s3.S3FileIO") .config("spark.hadoop.fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXxxx") .config("spark.hadoop.fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXx") .config("spark.hadoop.fs.s3a.aws.region", "us-west-2") .getOrCreate();

val df1 = spark.sql("select * from default.iceberg_table_exercise1");

Error- Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the environment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])): Profile file contained no credentials for profile 'default': ProfileFile(profilesAndSectionsMap=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.]

This code is throwing unable to load access and secret key but when we passing these information with System.setproperty then its working.

But our requirement is set to spark level not system level.

Jars we used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , iceberg-aws-bundle-1.5.0

Please any one help ASAP

Thanks,

nastra commented 6 months ago
.config("spark.hadoop.fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXxxx")
.config("spark.hadoop.fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXx")
.config("spark.hadoop.fs.s3a.aws.region", "us-west-2")

are the wrong settings for S3FileIO. What you need is s3.region / s3.access-key-id / s3.secret-access-key prefixed by spark.sql.catalog.AwsDataCatalog -> .config("spark.sql.catalog.AwsDataCatalog.s3.region", "...")

Also I noticed there's a typo in spark.sql.catalog.AwsDataCatalog.io-imp. It should be spark.sql.catalog.AwsDataCatalog.io-impl

AwasthiSomesh commented 6 months ago

@nastra Thanks for your reply

After adding this we see the same error. Could you please check is there any mistake from my end

val spark = SparkSession.builder().master("local[*]") .config("spark.sql.defaultCatalog", "AwsDataCatalog") .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.AwsDataCatalog","org.apache.iceberg.spark.SparkSessionCatalog") .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.AwsDataCatalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.AwsDataCatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .config("spark.sql.catalog.AwsDataCatalog.s3.region", "us-west-2") .config("spark.sql.catalog.AwsDataCatalog.s3.access-key-id", "XXXXXXXXXXXXXXXXXXXXXXXXxx") .config("spark.sql.catalog.AwsDataCatalog.s3.secret-access-key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxx")

 @nastra Could you please suggest if we need to add more dependencies to honor s3 credentials  and currently we are 
   using below jars only as a tpl.

 Jars used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , iceberg-aws-bundle-1.5.0

Thanks Somesh

AwasthiSomesh commented 6 months ago

@nastra Even I see we are getting error for region unable to load.

Error:- Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load region from any of the providers in the chain software.amazon.awssdk.regions.providers.DefaultAwsRegionProviderChain@610fbe1c: [software.amazon.awssdk.regions.providers.SystemSettingsRegionProvider@53cb0bcb: Unable to load region from system settings. Region must be specified either via environment variable (AWS_REGION) or system property (aws.region)., software.amazon.awssdk.regions.providers.AwsProfileRegionProvider@41f964f9: No region provided in profile: default, software.amazon.awssdk.regions.providers.InstanceProfileRegionProvider@11399548: Unable to contact EC2 metadata service.]

After adding region via - System.setProperty("aws.region", "us-west-2"); we are getting error for credentials load like below Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(),

After adding all properties via System.setProperty() like below everything is working fine.

System.setProperty("aws.region", "us-west-2"); System.setProperty("aws.accessKeyId", "xxxxxxxxxxxxxxxxxxxxxxx") System.setProperty("aws.secretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXx")

But we need to honor it via spark config property only because System.setProperty() is not in my use case.

Could you please help here .

Thanks, Somesh

nastra commented 6 months ago

I see you have

.config("spark.sql.catalog.AwsDataCatalog","org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog")

What you want is probably to only have .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog").

Having iceberg-spark-runtime-3.5_2.12-1.5.0 + iceberg-aws-bundle-1.5.0 should be enough in terms of dependencies.

You might want to go through https://iceberg.apache.org/docs/nightly/aws/#glue-catalog to double-check what else you need for Glue.

AwasthiSomesh commented 6 months ago

@nastra Thanks a lot for your reply.

I am facing credentials load issue from spark .. could you please help how to add access-key and secret key with region on spark config.

Because same program if we are adding thorough System.SetProperty() its working fine but if we are adding through spark its having credentials loading issue.

Thanks, Somesh

AwasthiSomesh commented 6 months ago

@lxs360 were you able resolve this issue ? I'm also facing the same issue using same spark configuration?

I see you already reported this issue long back -https://github.com/apache/iceberg/issues/4739

nastra commented 6 months ago

@AwasthiSomesh you might want to go through the link that I posted (and also through the Glue docs on how to connect to Iceberg). The settings you need could depend on your local AWS client credential setup (e.g. you might need to specify glue.id)

AwasthiSomesh commented 6 months ago

@nastra I checked everything .. I hope spark configuration for amazon s3 configuration is not working for iceberg but

System.setProperty("aws.region", "us-west-2"); System.setProperty("aws.accessKeyId", "xxxxxxxxxxxxxxxxxxxxxxx") System.setProperty("aws.secretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXx")

above setting is working fine not sure what configuration needed for spark .. its always asking default credentials chain provider.

Thanks, Somesh

AwasthiSomesh commented 6 months ago

@lxs360 do you have any solution for above query ?.

andythsu commented 1 month ago

facing the same issue as well. I'm on iceberg-aws-bundle-1.6.0 but it still complains region not set

nastra commented 1 month ago

@andythsu how does your Spark config look like?

clamar14 commented 1 month ago

@AwasthiSomesh, @andythsu were you able to solve it? i have the same problem on iceberg-spark-runtime-3.5_2.12:1.5.2

andythsu commented 1 month ago

@clamar14 @nastra I ended up using different jars and putting everything together instead of using iceberg-aws-bundle

def get_builder() -> SparkSession.Builder:
    return (
        SparkSession.builder \
    .config(
        "spark.jars",
        (
            f"{JARS_BASE_PATH}/iceberg-spark-runtime-3.4_2.12-1.6.0.jar"
            f",{JARS_BASE_PATH}/iceberg-spark-extensions-3.4_2.12-1.6.0.jar"
            f",{JARS_BASE_PATH}/hadoop-aws-3.3.2.jar" # needed for hadoop s3 file system
            f",{JARS_BASE_PATH}/aws-java-sdk-bundle-1.12.769.jar"
            f",{JARS_BASE_PATH}/protobuf-java-3.2.0.jar"
        )
    ...
    )
HemantMarve commented 4 weeks ago

@clamar14 , @nastra Try after adding below property:

.config("spark.sql.catalog.AwsDataCatalog.client.region","us-south")

or

.config("spark.sql.catalog.your_catalog_name.client.region","us-south")

clamar14 commented 4 weeks ago

Thank you all, I finally solved it this way:

.config("spark.sql.catalog.AwsDataCatalog.s3.access-key-id", "xxx")
.config("spark.sql.catalog.AwsDataCatalog.s3.secret-access-key", "xxx")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.slf4j:slf4j-simple:1.6.1,org.slf4j:slf4j-api:1.6.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.91.2,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,software.amazon.awssdk:bundle:2.17.257,software.amazon.awssdk:url-connection-client:2.17.257")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions")
.config("spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator")
.config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
.config("spark.sql.catalog.nessie.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.nessie.warehouse", "s3a://xxx/nessie")
.config("spark.sql.catalog.nessie.s3.endpoint", "https://xxx")
.config("spark.sql.catalog.nessie.uri", "http://xxx")
.config("spark.sql.catalog.nessie.ref", "main")
.config("spark.sql.catalog.nessie.authentication.type", "NONE")
.config"spark.sql.warehouse.dir", "s3a://xxx/nessie")
.config("spark.sql.catalog.nessie.client.credentials-provider", "software.amazon.awssdk.auth.credentials.SystemPropertyCredentialsProvider")
.config("spark.driver.extraJavaOptions", "-Daws.region=eu-central-1")
.config("spark.executor.extraJavaOptions", "-Daws.region=eu-central-1")
.config("spark.sql.catalog.nessie.s3.access-key-id", "xxx")
.config("spark.sql.catalog.nessie.s3.secret-access-key", "xxx")

In particular, the extraJavaOptions configurations were helpful in removing the ‘Unable to load region ’ error, while with the last two lines I solved the ‘Unable to load credentials ’ error

AwasthiSomesh commented 3 weeks ago

@andythsu @clamar14

Please try with aws custom credential provider follow the below approach if needed, i resolved this issue using below implementation

val spark = SparkSession.builder().master("local[*]") // .config("spark.sql.defaultCatalog", "glue_catalog") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.glue_catalog.warehouse", "s3://bucket/test/icebergsorted/otfdb/") .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .config("spark.sql.catalog.glue_catalog.client.credentials-provider", "example.CustomCredentialProvider") .config("spark.sql.catalog.glue_catalog.client.region", "us-east-1") .config("spark.sql.catalog.glue_catalog.client.credentials-provider.accessKeyId", "") .config("spark.sql.catalog.glue_catalog.client.credentials-provider.secretAccessKey", "") .getOrCreate()

Please note the CustomCredentialProvider package mentioned above, it was extended as below.

class CustomCredentialProvider extends AwsCredentialsProvider {

private var credentials: AwsCredentials = null

def this(keys: util.Map[String, String]) { this() credentials = AwsBasicCredentials.create(keys.get("accessKeyId"), keys.get("secretAccessKey")) }

override def resolveCredentials: AwsCredentials = this.credentials }

object CustomCredentialProvider{

Create a credentials provider that always returns the provided set of credentials.

def create(keys: util.Map[String, String]): CustomCredentialProvider = { new CustomCredentialProvider(keys) } }

AwasthiSomesh commented 3 weeks ago

@andythsu @clamar14 no need extra jars and other configuration.

AwasthiSomesh commented 2 weeks ago

based on above details it working fine

sean-lynch commented 2 weeks ago

@clamar14's suggestion of

.config("spark.driver.extraJavaOptions", "-Daws.region=us-west-1")

was the only way I could find to provide a region as part of the PySpark SparkSession configuration.

Given the following package versions: