dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
395 stars 221 forks source link

[ADAP-803] The existing table '' is in another format than 'delta' or 'iceberg' or 'hudi' #870

Open roberto-rosero opened 1 year ago

roberto-rosero commented 1 year ago

Is this a new bug in dbt-spark?

Current Behavior

I ran a dbt snapshot the first time and it ran very well, but the second time occurs the error in the title of this bug.

Expected Behavior

Create the snapshot like the first time.

Steps To Reproduce

snapshots:
  +schema: analytics
  +file_format: iceberg
{% snapshot customer_snapshot_v2 %}

{{
        config(
          target_schema='my_schema',
          strategy='check',
          unique_key='SocialId',
          check_cols=['Categoria', 'SubCategoria'],
        )
    }}

select * 
from {{ ref("seedCustomer") }}

{% endsnapshot %}

Relevant log output

No response

Environment

- OS:
- Python: 3.10.12
- dbt-core: 1.6
- dbt-spark: 1.6

Additional Context

No response

dondelicaat commented 1 year ago

I observe similar behaviour. Tables are registered in the Hive Metastore. This can be reproduced as follows:

Create the test schema:

CREATE DATABASE IF NOT EXISTS test LOCATION 'gs://my-project/my-bucket'

Then running the following incremental model:

{% snapshot test_snapshot %}

{{
    config(
        strategy='timestamp',
        unique_key='id',
        target_schema='test',
        updated_at='date',
        file_format='iceberg'
) }}

SELECT 1 AS id, CURRENT_DATE() AS date

{% endsnapshot %}

The first time it runs fine as @roberto-rosero mentioned. The second time it indeed fails. In spark I defined the Iceberg catalog as follows:

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog

With logs:

15:14:28.306183 [info ] [MainThread]: Completed with 1 error and 0 warnings:
15:14:28.306706 [info ] [MainThread]: 
15:14:28.307121 [error] [MainThread]: Compilation Error in snapshot test_snapshot (snapshots/test_snapshot.sql)
15:14:28.307517 [error] [MainThread]:   The existing table test.test_snapshot is in another format than 'delta' or 'iceberg' or 'hudi'
15:14:28.307896 [error] [MainThread]:   
15:14:28.308272 [error] [MainThread]:   > in macro materialization_snapshot_spark (macros/materializations/snapshot.sql)
15:14:28.308649 [error] [MainThread]:   > called by snapshot test_snapshot (snapshots/test.sql)

It does work if I explicitly include the catalog in the target_schema:

{% snapshot test_snapshot %}

{{
    config(
        strategy='timestamp',
        unique_key='id',
        target_schema='spark_catalog.test',
        updated_at='date',
        file_format='iceberg'
) }}

SELECT 1 AS id, CURRENT_DATE() AS date

{% endsnapshot %}

For normal DBT tables it (re)runs fine without explicitly specifying the metastore. I tried diving into the code at the location indicated by the logs macros/materializations/snapshot.sql but had a difficult time trying to run the macro correctly / figuring out why this is going wrong. Using the same setup as OP.

Any help is appreciated!

rshanmugam1 commented 1 year ago

Encountering a similar issue. When I specifically incorporate the catalog within the target_schema, it utilizes the "create or replace" statement instead of performing a merge operation on subsequent attempts

Mariana-Ferreiro commented 2 weeks ago

The same thing is happening to us, in our case the table is Iceberg but the provider it uses is Hive. Reviewing in the impl.py of dbt-spark, and debugging our code, we understand that it never meets the condition for the Hive provider even if the table is an iceberg.

It can be seen in the def of the build_spark_relation_list method.

We understand that it is a bug of the impl.py since the table is of type Iceberg.

To solve it, we have chosen to generate the snapshots macro at the project level and remove the control that validated what type of table it was.

Code removed from snapshot macro

 {%- if target_relation_exists -%}
    {%- if not target_relation.is_delta and not target_relation.is_iceberg and not target_relation.is_hudi -%}
      {% set invalid_format_msg -%}
        The existing table {{ model.schema }}.{{ target_table }} is in another format than 'delta' or 'iceberg' or 'hudi'
      {%- endset %}
      {% do exceptions.raise_compiler_error(invalid_format_msg) %}
    {% endif %}
  {% endif %}

This was the way so far that we managed to obtain the desired snapshot behavior.

Environment