Closed AwasthiSomesh closed 2 weeks ago
.config("spark.hadoop.fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXxxx")
.config("spark.hadoop.fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXx")
.config("spark.hadoop.fs.s3a.aws.region", "us-west-2")
are the wrong settings for S3FileIO
. What you need is s3.region
/ s3.access-key-id
/ s3.secret-access-key
prefixed by spark.sql.catalog.AwsDataCatalog
-> .config("spark.sql.catalog.AwsDataCatalog.s3.region", "...")
Also I noticed there's a typo in spark.sql.catalog.AwsDataCatalog.io-imp
. It should be spark.sql.catalog.AwsDataCatalog.io-impl
@nastra Thanks for your reply
After adding this we see the same error. Could you please check is there any mistake from my end
val spark = SparkSession.builder().master("local[*]")
.config("spark.sql.defaultCatalog", "AwsDataCatalog")
.config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.AwsDataCatalog","org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.AwsDataCatalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.AwsDataCatalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.AwsDataCatalog.s3.region", "us-west-2")
.config("spark.sql.catalog.AwsDataCatalog.s3.access-key-id", "XXXXXXXXXXXXXXXXXXXXXXXXxx")
.config("spark.sql.catalog.AwsDataCatalog.s3.secret-access-key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxx")
@nastra Could you please suggest if we need to add more dependencies to honor s3 credentials and currently we are
using below jars only as a tpl.
Jars used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , iceberg-aws-bundle-1.5.0
Thanks Somesh
@nastra Even I see we are getting error for region unable to load.
Error:- Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load region from any of the providers in the chain software.amazon.awssdk.regions.providers.DefaultAwsRegionProviderChain@610fbe1c: [software.amazon.awssdk.regions.providers.SystemSettingsRegionProvider@53cb0bcb: Unable to load region from system settings. Region must be specified either via environment variable (AWS_REGION) or system property (aws.region)., software.amazon.awssdk.regions.providers.AwsProfileRegionProvider@41f964f9: No region provided in profile: default, software.amazon.awssdk.regions.providers.InstanceProfileRegionProvider@11399548: Unable to contact EC2 metadata service.]
After adding region via - System.setProperty("aws.region", "us-west-2"); we are getting error for credentials load like below Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(),
After adding all properties via System.setProperty() like below everything is working fine.
System.setProperty("aws.region", "us-west-2"); System.setProperty("aws.accessKeyId", "xxxxxxxxxxxxxxxxxxxxxxx") System.setProperty("aws.secretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXx")
But we need to honor it via spark config property only because System.setProperty() is not in my use case.
Could you please help here .
Thanks, Somesh
I see you have
.config("spark.sql.catalog.AwsDataCatalog","org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog")
What you want is probably to only have .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog")
.
Having iceberg-spark-runtime-3.5_2.12-1.5.0
+ iceberg-aws-bundle-1.5.0
should be enough in terms of dependencies.
You might want to go through https://iceberg.apache.org/docs/nightly/aws/#glue-catalog to double-check what else you need for Glue.
@nastra Thanks a lot for your reply.
I am facing credentials load issue from spark .. could you please help how to add access-key and secret key with region on spark config.
Because same program if we are adding thorough System.SetProperty() its working fine but if we are adding through spark its having credentials loading issue.
Thanks, Somesh
@lxs360 were you able resolve this issue ? I'm also facing the same issue using same spark configuration?
I see you already reported this issue long back -https://github.com/apache/iceberg/issues/4739
@AwasthiSomesh you might want to go through the link that I posted (and also through the Glue docs on how to connect to Iceberg). The settings you need could depend on your local AWS client credential setup (e.g. you might need to specify glue.id
)
@nastra I checked everything .. I hope spark configuration for amazon s3 configuration is not working for iceberg but
System.setProperty("aws.region", "us-west-2"); System.setProperty("aws.accessKeyId", "xxxxxxxxxxxxxxxxxxxxxxx") System.setProperty("aws.secretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXx")
above setting is working fine not sure what configuration needed for spark .. its always asking default credentials chain provider.
Thanks, Somesh
@lxs360 do you have any solution for above query ?.
facing the same issue as well. I'm on iceberg-aws-bundle-1.6.0
but it still complains region not set
@andythsu how does your Spark config look like?
@AwasthiSomesh, @andythsu were you able to solve it? i have the same problem on iceberg-spark-runtime-3.5_2.12:1.5.2
@clamar14 @nastra I ended up using different jars and putting everything together instead of using iceberg-aws-bundle
def get_builder() -> SparkSession.Builder:
return (
SparkSession.builder \
.config(
"spark.jars",
(
f"{JARS_BASE_PATH}/iceberg-spark-runtime-3.4_2.12-1.6.0.jar"
f",{JARS_BASE_PATH}/iceberg-spark-extensions-3.4_2.12-1.6.0.jar"
f",{JARS_BASE_PATH}/hadoop-aws-3.3.2.jar" # needed for hadoop s3 file system
f",{JARS_BASE_PATH}/aws-java-sdk-bundle-1.12.769.jar"
f",{JARS_BASE_PATH}/protobuf-java-3.2.0.jar"
)
...
)
@clamar14 , @nastra Try after adding below property:
.config("spark.sql.catalog.AwsDataCatalog.client.region","us-south")
or
.config("spark.sql.catalog.your_catalog_name.client.region","us-south")
Thank you all, I finally solved it this way:
.config("spark.sql.catalog.AwsDataCatalog.s3.access-key-id", "xxx")
.config("spark.sql.catalog.AwsDataCatalog.s3.secret-access-key", "xxx")
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.slf4j:slf4j-simple:1.6.1,org.slf4j:slf4j-api:1.6.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.91.2,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,software.amazon.awssdk:bundle:2.17.257,software.amazon.awssdk:url-connection-client:2.17.257")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions")
.config("spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator")
.config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
.config("spark.sql.catalog.nessie.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.nessie.warehouse", "s3a://xxx/nessie")
.config("spark.sql.catalog.nessie.s3.endpoint", "https://xxx")
.config("spark.sql.catalog.nessie.uri", "http://xxx")
.config("spark.sql.catalog.nessie.ref", "main")
.config("spark.sql.catalog.nessie.authentication.type", "NONE")
.config"spark.sql.warehouse.dir", "s3a://xxx/nessie")
.config("spark.sql.catalog.nessie.client.credentials-provider", "software.amazon.awssdk.auth.credentials.SystemPropertyCredentialsProvider")
.config("spark.driver.extraJavaOptions", "-Daws.region=eu-central-1")
.config("spark.executor.extraJavaOptions", "-Daws.region=eu-central-1")
.config("spark.sql.catalog.nessie.s3.access-key-id", "xxx")
.config("spark.sql.catalog.nessie.s3.secret-access-key", "xxx")
In particular, the extraJavaOptions configurations were helpful in removing the ‘Unable to load region ’ error, while with the last two lines I solved the ‘Unable to load credentials ’ error
@andythsu @clamar14
Please try with aws custom credential provider follow the below approach if needed, i resolved this issue using below implementation
val spark = SparkSession.builder().master("local[*]")
// .config("spark.sql.defaultCatalog", "glue_catalog")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.glue_catalog.warehouse", "s3://bucket/test/icebergsorted/otfdb/")
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.glue_catalog.client.credentials-provider", "example.CustomCredentialProvider")
.config("spark.sql.catalog.glue_catalog.client.region", "us-east-1")
.config("spark.sql.catalog.glue_catalog.client.credentials-provider.accessKeyId", "
Please note the CustomCredentialProvider package mentioned above, it was extended as below.
class CustomCredentialProvider extends AwsCredentialsProvider {
private var credentials: AwsCredentials = null
def this(keys: util.Map[String, String]) { this() credentials = AwsBasicCredentials.create(keys.get("accessKeyId"), keys.get("secretAccessKey")) }
override def resolveCredentials: AwsCredentials = this.credentials }
object CustomCredentialProvider{
Create a credentials provider that always returns the provided set of credentials.
def create(keys: util.Map[String, String]): CustomCredentialProvider = { new CustomCredentialProvider(keys) } }
@andythsu @clamar14 no need extra jars and other configuration.
based on above details it working fine
@clamar14's suggestion of
.config("spark.driver.extraJavaOptions", "-Daws.region=us-west-1")
was the only way I could find to provide a region as part of the PySpark SparkSession configuration.
Given the following package versions:
Hi Team ,
We are doing below code to access iceberg table from glue catalog and data storage as S3
var spark = SparkSession.builder().master("local[*]") .config("spark.sql.defaultCatalog", "AwsDataCatalog") .config("spark.sql.catalog.AwsDataCatalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.AwsDataCatalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") .config("spark.sql.catalog.AwsDataCatalog.io-imp", "org.apache.iceberg.aws.s3.S3FileIO") .config("spark.hadoop.fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXxxx") .config("spark.hadoop.fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXXXXx") .config("spark.hadoop.fs.s3a.aws.region", "us-west-2") .getOrCreate();
val df1 = spark.sql("select * from default.iceberg_table_exercise1");
Error- Exception in thread "main" software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from any of the providers in the chain AwsCredentialsProviderChain(credentialsProviders=[SystemPropertyCredentialsProvider(), EnvironmentVariableCredentialsProvider(), WebIdentityTokenCredentialsProvider(), ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])), ContainerCredentialsProvider(), InstanceProfileCredentialsProvider()]) : [SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., EnvironmentVariableCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId)., WebIdentityTokenCredentialsProvider(): Either the environment variable AWS_WEB_IDENTITY_TOKEN_FILE or the javaproperty aws.webIdentityTokenFile must be set., ProfileCredentialsProvider(profileName=default, profileFile=ProfileFile(profilesAndSectionsMap=[])): Profile file contained no credentials for profile 'default': ProfileFile(profilesAndSectionsMap=[]), ContainerCredentialsProvider(): Cannot fetch credentials from container - neither AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set., InstanceProfileCredentialsProvider(): Failed to load credentials from IMDS.]
This code is throwing unable to load access and secret key but when we passing these information with System.setproperty then its working.
But our requirement is set to spark level not system level.
Jars we used : - iceberg-spark-runtime-3.5_2.12-1.5.0 , iceberg-aws-bundle-1.5.0
Please any one help ASAP
Thanks,