Closed JanKrl closed 3 months ago
Another relevant finding for this issue - when creating new table from seed, a non-iceberg table is creates:
Input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
This article about DBT and Glue doesn't mention this specifically but seems like DBT-Glue is not able to read Iceberg tables (InputFormat cannot be null). In their setup they use Hive tables for intermediate stage and Iceberg only for final layer. Furthermore, it doesn't work on Glue 4.0 but it seems to work in Glue 3.0.
Can anyone confirm my conclusion that Iceberg table can be used only in a final stage of the processing pipeline?
Have the same issue with Iceberg. Maybe also related to the fact I use LakeFormation
hi, i have the same issue (dbt and dbt-glue 1.7, glue 4.0, with lake formation), so i tried replicating the dbt code and running it in a glue notebook, and i did get the exact same error in the notebook as well.
adding glue_catalog.
to table name did work for me in the notebook, but i couldn't really apply this solution to dbt, since i don't have control over that piece of code.
instead - i added these configs:
.config("spark.sql.defaultCatalog", "glue_catalog") \
.config("spark.sql.catalog.glue_catalog.default-namespace", "via_stage") \
that also worked in the notebook, since the job now used my catalog instead of the default one (named default
).
however, i still couldn't get dbt to work, even though i added these two configs in the profiles yaml. i have a suspicion that dbt is not using these configs properly...
to conclude - i've identified 2 problems:
update - got it working, don't know why it didn't work before...
the solution was adding the default configs -
--conf spark.sql.defaultCatalog=glue_catalog
--conf spark.sql.catalog.glue_catalog.default-namespace=<schema>
update - got it working, don't know why it didn't work before...
the solution was adding the default configs -
--conf spark.sql.defaultCatalog=glue_catalog --conf spark.sql.catalog.glue_catalog.default-namespace=<schema>
When trying this I get the error:
Catalog 'glue_catalog' plugin class not found: spark.sql.catalog.glue_catalog is not defined
. I tried with both Glue 3.0 and 4.0
This is my conf now:
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.defaultCatalog=glue_catalog
--conf spark.sql.catalog.glue_catalog.default-namespace=<schema-name>
@JanKrl this is exactly what i have (only spark.sql.catalog.glue_catalog.warehouse
might be missing) and it's working for me.
did you make sure to leave out the first --conf
from the string? i made that mistake 😅 so no config was actually used
conf: >
spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
...
That did the trick!
Plus, I had to set --conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket>
due to the error: IllegalArgumentException: Cannot initialize GlueCatalog because warehousePath must not be null
For sake of clarity, here is the full config:
type: glue
glue_version: "3.0"
query-comment: DBT model for Iceberg tables
role_arn: <role-arn>
region: eu-central-1
location: <s3-bucket>
schema: <schema-name>
session_provisioning_timeout_in_seconds: 120
workers: 2
worker_type: G.1X
idle_timeout: 5
datalake_formats: iceberg
conf: >
spark.sql.defaultCatalog=glue_catalog
--conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket>
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
Describe the bug
When reading a source table (Iceberg) I get the following error:
After some googling, I found suggestion to add
glue_catalog
before table name. This results with:Steps To Reproduce
Apache Iceberg
-
-
-
As far as I can tell this is expected outcome.
I also tried all sorts of additional configs based on what I found online:
sources:
select country_name from glue_catalog.{{ source('data_source', 'countries') }}
Core:
latest: 1.7.10 - Update available!
Your version of dbt-core is out of date! You can find instructions for upgrading here: https://docs.getdbt.com/docs/installation
Plugins:
The operating system you're using:
The output of
python --version
:Python 3.11.0
Additional context