Open roberto-rosero opened 1 year ago
I observe similar behaviour. Tables are registered in the Hive Metastore. This can be reproduced as follows:
Create the test schema:
CREATE DATABASE IF NOT EXISTS test LOCATION 'gs://my-project/my-bucket'
Then running the following incremental model:
{% snapshot test_snapshot %}
{{
config(
strategy='timestamp',
unique_key='id',
target_schema='test',
updated_at='date',
file_format='iceberg'
) }}
SELECT 1 AS id, CURRENT_DATE() AS date
{% endsnapshot %}
The first time it runs fine as @roberto-rosero mentioned. The second time it indeed fails. In spark I defined the Iceberg catalog as follows:
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
With logs:
[0m15:14:28.306183 [info ] [MainThread]: [31mCompleted with 1 error and 0 warnings:[0m
[0m15:14:28.306706 [info ] [MainThread]:
[0m15:14:28.307121 [error] [MainThread]: [33mCompilation Error in snapshot test_snapshot (snapshots/test_snapshot.sql)[0m
[0m15:14:28.307517 [error] [MainThread]: The existing table test.test_snapshot is in another format than 'delta' or 'iceberg' or 'hudi'
[0m15:14:28.307896 [error] [MainThread]:
[0m15:14:28.308272 [error] [MainThread]: > in macro materialization_snapshot_spark (macros/materializations/snapshot.sql)
[0m15:14:28.308649 [error] [MainThread]: > called by snapshot test_snapshot (snapshots/test.sql)
It does work if I explicitly include the catalog in the target_schema
:
{% snapshot test_snapshot %}
{{
config(
strategy='timestamp',
unique_key='id',
target_schema='spark_catalog.test',
updated_at='date',
file_format='iceberg'
) }}
SELECT 1 AS id, CURRENT_DATE() AS date
{% endsnapshot %}
For normal DBT tables it (re)runs fine without explicitly specifying the metastore. I tried diving into the code at the location indicated by the logs macros/materializations/snapshot.sql
but had a difficult time trying to run the macro correctly / figuring out why this is going wrong. Using the same setup as OP.
Any help is appreciated!
Encountering a similar issue. When I specifically incorporate the catalog within the target_schema, it utilizes the "create or replace" statement instead of performing a merge operation on subsequent attempts
The same thing is happening to us, in our case the table is Iceberg but the provider it uses is Hive. Reviewing in the impl.py of dbt-spark, and debugging our code, we understand that it never meets the condition for the Hive provider even if the table is an iceberg.
It can be seen in the def of the build_spark_relation_list method.
We understand that it is a bug of the impl.py since the table is of type Iceberg.
To solve it, we have chosen to generate the snapshots macro at the project level and remove the control that validated what type of table it was.
Code removed from snapshot macro
{%- if target_relation_exists -%}
{%- if not target_relation.is_delta and not target_relation.is_iceberg and not target_relation.is_hudi -%}
{% set invalid_format_msg -%}
The existing table {{ model.schema }}.{{ target_table }} is in another format than 'delta' or 'iceberg' or 'hudi'
{%- endset %}
{% do exceptions.raise_compiler_error(invalid_format_msg) %}
{% endif %}
{% endif %}
This was the way so far that we managed to obtain the desired snapshot behavior.
Environment
Is this a new bug in dbt-spark?
Current Behavior
I ran a dbt snapshot the first time and it ran very well, but the second time occurs the error in the title of this bug.
Expected Behavior
Create the snapshot like the first time.
Steps To Reproduce
Relevant log output
No response
Environment
Additional Context
No response