dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
395 stars 221 forks source link

[ADAP-946] [CT-2689] [Bug] Incremental models ran from scratch when created in a pre-hook #918

Closed jelstongreen closed 1 year ago

jelstongreen commented 1 year ago

Is this a new bug in dbt-core?

Current Behavior

When running incremental models in a dev / ci environment I am trying to shallow clone the production tables into the dev schema to fully simulate how the changes might affect the production model. Unfortunately whether a model should be run with is_incremental or not is determined at compile time, before the pre-hook creating the dev version of the model is run. This leads to dbt running the dev model from scratch even though it already exists.

Expected Behavior

Ideally, the check for whether to run the model in incremental mode or not should be executed at runtime.

Steps To Reproduce

Opened a fresh branch which will execute tables in branch db. Run the model and it executes: pre-hook:

CREATE TABLE IF NOT EXISTS `hive_metastore`.`branch`.`model` SHALLOW CLONE prod.model

model:

create
or replace table `hive_metastore`.`branch`.`model` using delta as with src as (
  SELECT
    *
  FROM
    `hive_metastore`.`branch`.`src`
)
SELECT
  *
FROM
  src
WHERE
  timestamp >= (SELECT min(timestamp) FROM src)

If I then cancel that query and rerun dbt: pre-hook:

CREATE TABLE IF NOT EXISTS `hive_metastore`.`branch`.`model` SHALLOW CLONE prod.model

model:

merge into `hive_metastore`.`branch`.`model` as DBT_INTERNAL_DEST using `model__dbt_tmp` as DBT_INTERNAL_SOURCE on DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
when matched then
update
set
  *
  when not matched then
insert
  *

The only difference between the two runs is that the dev version of the model (the clone) exists in the 2nd one at compile time.

Relevant log output

No response

Environment

- OS: MacOS 13.3.1 (a) 
- Python: 3.10.8
- dbt: 1.4.1

Which database adapter are you using with dbt?

spark, other (mention it in "Additional Context")

Additional Context

Databricks

dbeatty10 commented 1 year ago

Thanks for opening this @jelstongreen !

Before we get into the details of this bug report, you might be interested in subscribing to https://github.com/dbt-labs/dbt-core/issues/7256 since the big idea in that feature is:

Clone my production state into my development schema, please!

Reproducing this

Could you help me reproduce this by supplying example model .sql file(s) (and any related YAML configuration files)?

And also the commands to trigger the error?

i.e., something like this:

models/my_incremental_model.sql

{{ config(
    materialized='incremental',
    pre_hook="SQL-statement",
    ...
) }}

select ...

dbt_project.yml

# anything applicable from a dbt_project.yml file here

models/_models.yml

# anything applicable from a (model) properties file here

Commands:

dbt build --full-refresh
dbt run --select my_incremental_model
jelstongreen commented 1 year ago

Hi @dbeatty10. Here you go: models/my_incremental_model.sql

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key="unique_id",
    pre_hook = [
        """
        {{ shallow_clone_production() }}
        """
    ],
    post_hook = [
        """
        {{ optimize_and_zorder(zorder_cols=['col_a','col_b']) }}
        """
    ] 
) }}

SELECT col_a, col_b...

shallow_clone_production.sql

{%- macro shallow_clone_production() %}
    {% if target.name in ('dev','testing') %}
        {% if var('clone_incremental_dev', 'True') == 'True' %}
            CREATE TABLE IF NOT EXISTS {{ this }}
            SHALLOW CLONE {{ var('prod_schema') }}.{{ this.name }}
        {% endif %}
    {% endif %}
{%- endmacro %}

command

dbt run -s my_incremental_model --project-dir projects/my_project
jelstongreen commented 1 year ago

Given that this is being looked at for 1.6 I imagine this isn't going to be a priority. I can create a workaround in the meantime and run the macro as an operation in a step for invoking dbt unless this is something you'd like to look at.

dbeatty10 commented 1 year ago

Thanks for that example @jelstongreen -- that helps me see exactly what is going on.

When you invoke dbt like dbt run -s my_model --target dev with the setup you described, here's what it does:

So you can see from the above how the error follows:

  1. The pre-hook creates the dev version of the table
  2. Then the materialization tries to create that same dev version of the table 💥

Running a macro as an operation is a good workaround until v1.6 comes out.

Another option you could try if you are so inclined:

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

amyfeng-klaviyo commented 11 months ago

can we check the status of this please?

dbeatty10 commented 11 months ago

@amyfeng-klaviyo we've closed this issue since we didn't find any bug within dbt-core to fix.

If you think you're experiencing a bug within dbt-core, could you open a new issue here and describe it in detail so we can try to reproduce it?

If you have a substantially similar use-case as jelstongreen described here, then I'd suggest one of the following:

amyfeng-klaviyo commented 11 months ago

@dbeatty10 thanks for the update! my question is really the same as the bug described above. We have set up a pre-hook macro to clone the production table for an incremental model (mainly for CI). while i m testing the pre-hook in local, The table:

  1. pre-hook: create or replace table dev.abc clone prod.abc
  2. model run: create or replace table dev.abc as ... (another full build instead of running it as incremental model) I am wondering y the 2nd step wound not run as the model already existed in the dev env. we are currently on v1.3.3 and would need some testing before we move on to v1.6. i can see dbt clone might be the option here.