Unable to query Iceberg table from PySpark script in AWS Glue

Shubham-Jha-GT commented 2 years ago

I'm trying to read data from an iceberg table, the data is in ORC format and partitioned by column. I'm getting this error -

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table temp_tag_thrshld_iceberg. StorageDescriptor#InputFormat cannot be null for table: temp_tag_thrshld_iceberg (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

This is my code : spark = SparkSession.builder.config("spark.driver.memory", "25g").appName(app_name).getOrCreate() temp_tag_thrshld_data = spark.sql("SELECT * FROM dev_db.temp_tag_thrshld_iceberg")

If I replace my spark.sql("Select * from a_normal_athena_table) the code runs fine. I'm also not able to read the data directly from S3 as its an ORC format with Snappy compression so I don't get any results (I'm probably missing the correct framework to read S3 ORC directly but that's another issue for another day)

I've tried validating my table using aws glue get-table --database-name dev_db --name temp_tag_thrshld_iceberg

and this is the output I got -

{ "Table": { "Name": "temp_tag_thrshld_iceberg", "DatabaseName": "dev_db", "CreateTime": 1658864256.0, "UpdateTime": 1658864347.0, "Retention": 0, "StorageDescriptor": { "Columns": [ { "Name": "tag", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "1", "iceberg.field.optional": "true" } }, { "Name": "zipcode", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "2", "iceberg.field.optional": "true" } }, { "Name": "threshold_max", "Type": "double", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "3", "iceberg.field.optional": "true" } }, { "Name": "level", "Type": "string", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "4", "iceberg.field.optional": "true" } } ], "Location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg", "Compressed": false, "NumberOfBuckets": 0, "SortColumns": [], "StoredAsSubDirectories": false }, "TableType": "EXTERNAL_TABLE", "Parameters": { "metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00001-0ee5fbc7-044e-439d-aa1e-d76935002ebd.metadata.json", "previous_metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00000-3a8f33f0-fbef-48c3-b289-6021f62b8b8c.metadata.json", "table_type": "ICEBERG" }, "CreatedBy": "IAM Details", "IsRegisteredWithLakeFormation": false, "CatalogId": "571708111280", "VersionId": "1" } }

Shubham-Jha-GT commented 2 years ago

Updated the config to this (based on iceberg table configuration):

spark = SparkSession.builder.config("spark.driver.memory", "25g").config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog").config("spark.sql.catalog.spark_catalog.type", "hive").appName(app_name).getOrCreate()

I'm getting this new error - An error occurred while calling o87.sql. Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.iceberg.spark.SparkSessionCatalog

lgbaeza commented 2 years ago

Updated the config to this (based on iceberg table configuration):

spark = SparkSession.builder.config("spark.driver.memory", "25g").config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog").config("spark.sql.catalog.spark_catalog.type", "hive").appName(app_name).getOrCreate()

I'm getting this new error - An error occurred while calling o87.sql. Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.iceberg.spark.SparkSessionCatalog

That error seems to be due to not having the dependencies for iceberg. Have you configured your glue job with the iceberg connector from marketplace? Here you can find how https://aws.amazon.com/es/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg

Unable to query Iceberg table from PySpark script in AWS Glue #5369