dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.73k stars 1.61k forks source link

[Bug] DBT unit tests don't work properly when source table `name` matches another source or model #10433

Open yengibar-manasyan-sp opened 2 months ago

yengibar-manasyan-sp commented 2 months ago

Is this a new bug in dbt-core?

Current Behavior

One of my models had the same name as the source Snowflake table. They live in different schemas. dbt run works fine and generates all the data as expected. However, DBT unit tests fail because unit tests can not differentiate mocked source tables from these tables. Here is an example:

unit_tests:
  - name: test_model_1
    description:
    model: MODEL_1
    overrides:
      macros:
        is_incremental: true

    given:
      - input: this
        rows:
          - { column_1: "1", column_2: "2" }

      - input: source('src', 'MODEL_1')
        rows:
          - { column_1: "1", column_2: "3" }

    expect:
      rows:
        - { column_1: "1", column_2: "3" }

When I checked the generated SQL query, I saw that it adds __dbt__cte__ prefixes to the CTE names and replaced raws for source('src', 'MODEL_1') with values from this.

Expected Behavior

Fix unit tests to support identical model and source table names in different schemas.

Steps To Reproduce

  1. Create a model that relies on a table with the table's exact name
  2. Create an incremental unit test and add data both for the source table and this (the model itself)
  3. Run the unit test and see its outcome

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

dbeatty10 commented 2 months ago

Thanks for reporting this @yengibar-manasyan-sp !

Root cause

It appears that the root cause is that during unit testing, sources aren't guaranteed to have a CTE that is unique from refs or other sources.

Example

See below for a reproducible example ("reprex") of the issue you reported when a model and a source have the same name (case_1). It also includes a separate issue when two sources have the same name (case_2).

### Reprex `dbt_project.yml` ```yaml name: "my_project" version: "1.0.0" config-version: 2 ``` `models/model_a.sql` ```sql select -1 as id ``` `models/model_b.sql` ```sql select -2 as id ``` `models/case_0.sql` ```sql select * from {{ ref("model_a") }} union all select * from {{ ref("model_b") }} ``` `models/case_1.sql` ```sql select * from {{ ref("model_a") }} union all select * from {{ source("src_2", "model_a") }} ``` `models/case_2.sql` ```sql select * from {{ source("src_1", "model_a") }} union all select * from {{ source("src_2", "model_a") }} ``` Create some sources that just point to the database objects created by a couple of our models 😂 `models/_sources.yml` ```yaml sources: - name: src_1 schema: "{{ target.schema }}" tables: - name: model_a - name: src_2 schema: "{{ target.schema }}" tables: - name: model_a ``` `models/_unit_tests.yml` ```yaml unit_tests: - name: test_case_0 description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_0.sql" model: case_0 given: - input: ref("model_a") rows: - { id: 1 } - input: ref("model_b") rows: - { id: 2 } expect: rows: - { id: 1 } - { id: 2 } - name: test_case_1 description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_1.sql" model: case_1 given: - input: ref("model_a") rows: - { id: 1 } - input: source("src_2", "model_a") rows: - { id: 2 } expect: rows: - { id: 1 } - { id: 2 } - name: test_case_2 description: "Unit test SQL: target/run/my_project/models/_unit_tests.yml/models/test_case_2.sql" model: case_2 given: - input: source("src_1", "model_a") rows: - { id: 1 } - input: source("src_2", "model_a") rows: - { id: 2 } expect: rows: - { id: 1 } - { id: 2 } ``` Commands: ```shell dbt run -s model_a model_b dbt build -s case_0 dbt build -s case_1 dbt build -s case_2 ``` I get the following output: ```shell $ dbt build -s case_0 20:42:08 1 of 2 START unit_test case_0::test_case_0 ..................................... [RUN] 20:42:08 1 of 2 PASS case_0::test_case_0 ................................................ [PASS in 0.21s] 20:42:08 2 of 2 START sql view model feature_456.case_0 ................................. [RUN] 20:42:08 2 of 2 OK created sql view model feature_456.case_0 ............................ [OK in 0.12s] 20:42:08 20:42:08 Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.54 seconds (0.54s). 20:42:08 20:42:08 Completed successfully ``` ```shell $ dbt build -s case_1 20:41:15 Failure in unit_test test_case_1 (models/_unit_tests.yml) 20:41:15 actual differs from expected: @@ ,id ---,1 ,2 +++,2 ``` ```shell $ dbt build -s case_2 20:42:57 1 of 2 START unit_test case_2::test_case_2 ..................................... [RUN] 20:42:57 1 of 2 ERROR case_2::test_case_2 ............................................... [ERROR in 0.03s] 20:42:57 2 of 2 SKIP relation feature_456.case_2 ........................................ [SKIP] 20:42:57 20:42:57 Finished running 1 unit test, 1 view model in 0 hours 0 minutes and 0.23 seconds (0.23s). 20:42:57 20:42:57 Completed with 1 error and 0 warnings: 20:42:57 20:42:57 Compilation Error in unit_test test_case_2 (models/_unit_tests.yml) Unit_Test 'unit_test.my_project.case_2.test_case_2' (models/_unit_tests.yml) depends on a source named 'src_1.model_a' which was not found ```

Looking at the output files like the following helps highlight what is going wrong:

dbeatty10 commented 2 months ago

Acceptance criteria

The key outcome we'd need is:

A possible building block

One thing that is already to be unique across the DAG is unique_id, so possibly that could be transformed into a suitable CTE name.

Related code

Some related code is described in https://github.com/dbt-labs/dbt-core/issues/5273#issuecomment-1131543844.

https://github.com/dbt-labs/dbt-adapters/pull/236 and https://github.com/dbt-labs/dbt-core/pull/10290 have proposed updates to the code that generates the CTE name, but I don't think either would fix this particular issue with unit tests.

dbeatty10 commented 2 months ago

Relevant code

https://github.com/dbt-labs/dbt-core/blob/c668846404c78ea7317dc0523b49703c61c9885f/core/dbt/context/providers.py#L1595-L1600

Specifically, we could generate some kind of globally_unique_identifier and then use it like this:

            return self.adapter.Relation.add_ephemeral_prefix(globally_unique_identifier)
dbeatty10 commented 2 months ago

Implementation ideas

Here are a couple ideas of how to create the CTE in unit tests so that they are unique:

  1. Use a quoted version of unique_id + quoted identifiers
  2. Use a hexadecimal formatted hash of the unique_id + quoted identifiers

1. Use a quoted version of unique_id + quoted identifiers

Example:

with "__dbt__cte__source.my_project.src_1.model_a" as (
...
),
"__dbt__cte__model.my_project.model_a" as (

Pros:

Cons:

2. Use a hexadecimal formatted hash of the unique_id + quoted identifiers

Examples:

with __dbt__cte__e83c5163316f89bfbde7d9ab23ca2e25604af290 as (

or

with __dbt__cte__dim_customers_e83c5163316f as (

Pros:

Cons:

Details

We could hash the unique_id similar to _create_sha1_hash:

https://github.com/dbt-labs/dbt-core/blob/c668846404c78ea7317dc0523b49703c61c9885f/core/dbt/task/deps.py#L56

If we don't care about readability of the CTE name, then we could just use the hexdigest as-is which for SHA1 would be a 40-character value like this:

e83c5163316f89bfbde7d9ab23ca2e25604af290

If there are readability concerns, then we could combine an abbreviated version of the hash along with the node name (or identifier), which might look like this:

dim_customers_e83c5163316f
dbeatty10 commented 2 months ago

Implementation ideas (continued)

Here's some other ideas:

  1. Allow configuration of an source alias that is only used during unit tests
  2. Do option 1 or 2, but only for sources

3. Allow configuration of an source alias that is only used during unit tests

Similar to https://github.com/dbt-labs/dbt-core/pull/10290

Example:

with "__dbt__cte__PROVIDED_SOURCE_ALIAS_HERE" as (
...
),
"__dbt__cte__model_a" as (

Pros:

Cons:

4. Do option 1 or 2, but only for sources

Pros:

Cons:

larsstromholm commented 3 weeks ago

We've just come across this issue ourselves. Are you planning contributing to this in the nearest future, @dbeatty10 ?