Facing error when creating iceberg table in EMR using Glue catalog

arunb2w commented 2 years ago

Apache Iceberg version

0.14.0

Query engine

EMR

Please describe the bug 🐞

Facing error when creating iceberg table in EMR using Glue catalog. spark version : 3.2.1 iceberg version: 0.14.0

Sample code:

catalog = glue_dev
warehouse_path = "s3_bucket"
database = "test"
table_name = "EPAYMENT"

spark = SparkSession \
            .builder \
            .config(f'spark.sql.catalog.{catalog}', 'org.apache.iceberg.spark.SparkCatalog') \
            .config(f'spark.sql.catalog.{catalog}.warehouse', f'{warehouse_path}') \
            .config(f'spark.sql.catalog.{catalog}.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog') \
            .config(f'spark.sql.catalog.{catalog}.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') \
            .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
            .config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkSessionCatalog') \
            .config('spark.sql.catalog.spark_catalog.type', 'hive') \
            .appName("IcebergDatalake") \
            .getOrCreate()

df = spark_session.createDataFrame([
       ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
    ], ["id", "creation_date", "last_update_time"])
    df.writeTo(f"{catalog}.{database}." + table_name).using("iceberg").create()

Spark command used to run: spark-submit --deploy-mode cluster--packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.257,software.amazon.awssdk:url-connection-client:2.17.257 --conf spark.yarn.submit.waitAppCompletion=true --conf "spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=\"/opt/spark\"" --conf spark.dynamicAllocation.enabled=true --conf spark.executor.maxMemory=32g --conf spark.dynamicAllocation.executorIdleTimeout=300 --conf spark.shuffle.service.enabled=true --driver-memory 8g --num-executors 1 --executor-memory 32g --executor-cores 5 iceberg_main.py

Error stacktrace:

Traceback (most recent call last):
  File "iceberg_main.py", line 899, in <module>
    bootstrap_table(tableName, spark, write_type, is_local_run, hive_sync_enabled, database, catalog)
  File "iceberg_main.py", line 428, in bootstrap_table
    bootstrap_to_iceberg(table_name, write_type, spark_session, is_local_run, hive_sync_enabled, database, catalog, stacks)
  File "iceberg_main.py", line 407, in bootstrap_to_iceberg
    df.writeTo(f"{catalog}.{database}." + table_name).using("iceberg").create()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1664278990474_0004/container_1664278990474_0004_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 1129, in create
  File "/mnt/yarn/usercache/hadoop/appcache/application_1664278990474_0004/container_1664278990474_0004_01_000001/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/mnt/yarn/usercache/hadoop/appcache/application_1664278990474_0004/container_1664278990474_0004_01_000001/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.IllegalArgumentException: Invalid table identifier: test.EPAYMENT

Please provide insights on what am missing. The same code works fine, if i use hadoop catalog instead of Glue

singhpk234 commented 2 years ago

This is because glueCatalog has additonal vaildations on tablename it should only contain lower case alphabets. https://github.com/apache/iceberg/blob/6d2edd6284ebc5301dbe45376a31ca8316852a77/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L499-L506

can try setting glue.skip-name-validation via catalog properties if you wanna skip these validations : https://github.com/apache/iceberg/blob/6d2edd6284ebc5301dbe45376a31ca8316852a77/aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java#L106-L114

C-h-e-r-r-y commented 2 years ago

can try setting glue.skip-name-validation via catalog properties if you wanna skip these validations :

It is very hard to figure out how to set these propertes. Could you please share small example? I have tried spar.glue.skip-name-validation or spark.sql.glue.skip-name-validation or spark.sql.catalog.my_catalog.glue.skip-name-validation and have no luck :-(

singhpk234 commented 2 years ago

ideally

--conf spark.sql.catalog.{catalog_name}.glue.skip-name-validation=false

should have worked, can you please add the complete spark conf's you are giving and also iceberg version your are trying it with.

Note: this was added in iceberg 0.14.0 release

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg