Open jordanpolun opened 4 months ago
Hi @jordanpolun
If you want to work with both iceberg and non iceberg tables, you need to use SparkSessionCatalog.
org.apache.iceberg.spark.SparkSessionCatalog adds support for Iceberg tables to Spark's built-in catalog, and delegates to the built-in catalog for non-Iceberg tables For more details, check this link : [https://iceberg.apache.org/docs/latest/spark-configuration/#catalogs](SparkSessionCatalog configuration)
You'll need to specify the following configurations for the Glue session:
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.warehouse=s3://al-gdo-dev-ww-dl-0139-transfo/data --conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Keep in mind, that using the SparkSessionCatalog makes the CTAS and RTAS non atomic operations
@aiss93 I used the following conf
conf: >
spark.sql.legacy.timeParserPolicy=LEGACY
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://some-bucket/metadata
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.legacy.allowNonEmptyLocationInCTAS=true
--conf spark.serializer=org.apache.spark.serializer.JavaSerializer
And i got the error: org.apache.iceberg.exceptions.ValidationException: Input Glue table is not an iceberg table: spark_catalog.iceberg_db.my_table (type=null)
Are there any missing configurations that you could suggest?
Describe the bug
When trying to use
{{ source() }}
from a non-iceberg table, adbt-glue
project set up for the iceberg datalake format will refuse to read the source because it's not an iceberg table. The table is available in the Glue Data Catalog but the underlying files are JSON. The full logs are below but the main error isEverything worked smoothly before I tried to add Iceberg into the equation. Was able to read the source data from JSON and write it to Parquet, just not Iceberg.
Steps To Reproduce
profiles.yaml
sources.yaml
dbt_project.yaml
models/load/maf/load_maf_consent.sql
Expected behavior
dbt-glue
should be able to read from any table accessible via the Glue Data Catalog and write back to S3 in Iceberg format.Screenshots and log output
System information
The output of
dbt --version
:The operating system you're using: Mac OS
The output of
python --version
: Python 3.12.4Additional context
Nothing at the moment