Open BrentonAD opened 1 year ago
I have the same issue with insert_overwrite
were you able to fix this?
@BrentonAD your conf seems wrong, especially for "spark.sql.catalog.glue_catalog.warehouse":, you have a double s3://
@rickiesmooth what's your conf ? can you validate that you have the key : spark.sql.catalog.glue_catalog.warehouse
@AmineIzanami Thank you for your reply, that mistake was simply me accidentally mistyping when redacting company specific bucket names e.t.c. I can confirm that is not in the config when I am facing this issue, sorry about the confusion.
I want to confirm again that the load runs correctly the first time, just not on subsequent incremental runs so I am somewhat confident my config is correct.
@AmineIzanami thank you for looking into this! My conf looks like this:
glue_test:
target: dev
outputs:
dev:
type: glue
query-comment: glue_test
role_arn: "arn:aws:iam::XXXXXXXXX:role/ci_cron_role"
region: us-east-1
glue_version: "4.0"
workers: 2
worker_type: G.1X
schema: "glue_test"
session_provisioning_timeout_in_seconds: 60
location: "s3://glue-test-dev-pipeline-data-lake-output/"
datalake_formats: iceberg
conf: --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.warehouse=s3://glue-test-prod-pipeline-data-lake-output/glue_test --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
I've had a similar issue when trying delta datalake format
maybe good to note that with iceberg, runs for my "base" table that doesn't use any refs all succeed, it's tables that reference this "base" table that error.
@mehdimld pointed me to a workaround in slack:
for now if you want to ref an iceberg models built with dbt-glue you need to prefix the ref(‘other_model’) with “glue_catalog”. Please see below an example where merge_customers is an Iceberg table built with dbt-glue
{{ config( materialized='incremental', incremental_strategy='merge', unique_key=["customer_id"], file_format='iceberg', partition_by=['dt'], table_properties={'write.target-file-size-bytes': '268435456'} ) }}
select * from glue_catalog.{{ ref('merge_customers')}}
which fixes the issue!
Describe the bug
I have a simple model (for the sake of argument called my_table) which loads data from an existing iceberg table (not created with dbt-glue) and writes to a new iceberg table:
On a first run of the model, the table successfully loads and I can query from AWS Athena as expected however on subsequent runs of the model I receive the following error indicating that the table cannot be found in the glue_catalog.
My configuration in my profile.yml is as follows:
Steps To Reproduce
Expected behavior
Subsequent runs of dbt model will perform an incremental load as per the append strategy, rather than cause an error.
Screenshots and log output
If applicable, add screenshots or log output to help explain your problem.
System information
The output of
dbt --version
:The operating system you're using: VS Code Dev container running Ubuntu 22.04.3 LTS
The output of
python --version
:Python 3.11.4
Additional context
I have tried many different spark configurations, including different names for the iceberg catalog alias to match the Glue catalog name, and matched that name in the prefix for the
SELECT * FROM <prefix>.{{ source('staging','raw') }}