dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.72k stars 1.61k forks source link

[Bug] Inconsistent behaviour of 'defer' with the 'dbt compile' command #9991

Closed siljamardla closed 5 months ago

siljamardla commented 5 months ago

Is this a new bug in dbt-core?

Current Behavior

Defer does not work with simple compile. This will compile to SQL that uses the dev schema tables:

dbt compile --defer --state prod-run-artifacts

However, if you need to compile production SQL for a specific table, this will work:

dbt compile --defer --state prod-run-artifacts --select model_a

Expected Behavior

Any dbt command executed with --defershould have the same behaviour of combining live data with data that exist in the dev environment.

Steps To Reproduce

Relevant log output

No response

Environment

- OS: Mac OS
- Python: 3.11
- dbt: 1.7.9

Which database adapter are you using with dbt?

spark

Additional Context

https://github.com/dbt-labs/docs.getdbt.com/pull/5162

dbeatty10 commented 5 months ago

Thanks for reaching out @siljamardla -- indeed state selection can be tricky to think about!

I don't think this is a bug. Let me explain why.

Project setup

Suppose you have the following simple models in your dbt project and you start with a set of production artifacts:

Project files `models/model_a.sql` ```sql -- model_a: {{ this }} select 1 as id ``` `models/model_b.sql` ```sql -- model_b: {{ this }} select 2 as id ``` `models/model_c.sql` ```sql -- model_c: {{ this }} -- depends on: -- model_a: {{ ref("model_a") }} -- model_b: {{ ref("model_b") }} select 1 as id ```
Production artifacts `profiles.yml ```yaml duckdb: target: dev outputs: prod: type: duckdb path: 'db.db' schema: prod dev: type: duckdb path: 'db.db' schema: feature_456 ``` Build production and examine the database table names: ```shell dbt build --target prod dbt --quiet ls --output json --output-keys database schema alias --target prod | jq . ``` Save the production state: ```shell rm -rf prod-run-artifacts mv target prod-run-artifacts ```
image

💡 Insight 1

When you do a dbt compile --target dev, it will:

👉 Not specifying --select is equivalent to selecting all nodes.

💡 Insight 2

When you add --defer --favor-state --state prod-run-artifacts, then it will:

👉 --defer --state only comes into play for any un-selected nodes.

Example 1

These are all equivalent:

dbt compile --target dev --defer --favor-state --state prod-run-artifacts
dbt compile --target dev --defer --favor-state --state prod-run-artifacts --select "fqn:*"
dbt compile --target dev --defer --favor-state --state prod-run-artifacts --select model_c model_b model_a

And they all compile to this result (target/compiled/my_project/models/model_c.sql):

-- model_c: "db"."feature_456"."model_c"
-- depends on:
--   model_a: "db"."feature_456"."model_a"
--   model_b: "db"."feature_456"."model_b"

select 1 as id

Examples 2 and 3

Either of the examples below will compile with different output than above since some nodes are not selected.

Example 1

dbt compile --target dev --defer --favor-state --state prod-run-artifacts --select model_c model_b

is the same as:

dbt compile --target dev --defer --favor-state --state prod-run-artifacts --select model_c model_b model_a --exclude model_a

is the same as:

dbt compile --target dev --defer --favor-state --state prod-run-artifacts --exclude model_a

Compiled output:

-- model_c: "db"."feature_456"."model_c"
-- depends on:
--   model_a: "db"."prod"."model_a"
--   model_b: "db"."feature_456"."model_b"

select 1 as id

Example 2

dbt compile --target dev --defer --favor-state --state prod-run-artifacts --select model_c

Compiled output:

-- model_c: "db"."feature_456"."model_c"
-- depends on:
--   model_a: "db"."prod"."model_a"
--   model_b: "db"."prod"."model_b"

select 1 as id

Notes

Summary

I'm going to close this a not a bug, but please let me know I've missed something important and we can re-consider. 🙏

siljamardla commented 4 months ago

@dbeatty10 Thank you for the detailed examples, I get what you mean by defer never using production schema for those models that have been included in the --select clause.

I've actually come back to this issue because I was trying to use defer for testing. I guess I'm hitting the same concept here. I tried something like dbt test --select my_model --defer --state target/live. The model has a not null test defined for the order_key column. I was hoping I could compile and run the test for live data. Because, well, it would be good to test my test before I deploy it to live. And testing my test would be much more convenient and appropriate on live data, not on dev data. However, the compiled test reads:

select order_key
from dev_schema.my_model
where order_key is null

Is there a way to compile the test to run on live? If not, should I create a feature request? What could be a good way to achieve this functionality?

Edit: Out of curiosity I've also defined a more complex (dbt-expectations) test that has a ref to another table in it:

    tests:
      - dbt_expectations.expect_table_aggregation_to_equal_other_table:
          expression: count(*)
          compare_model: source("source_name","source_table_for_my_model")
          compare_expression: count(*)
          group_by: [created_date]
          compare_group_by: [created_date]
          row_condition: 
            created_date = (SELECT max(created_date) from {{ref('my_other_model')}})
          compare_row_condition: >
            created_date = (SELECT max(created_date) from {{ref('my_other_model')}})

I was thinking that maybe this reference to another table will compile for the live schema because it's not included in the --select when I run the test on my_model. But no, it will still compile to dev_schema.my_other_model in the test script.