dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.61k stars 1.59k forks source link

[CT-68] [Feature] Use tests on given models or sources to condition another model's execution #4591

Closed fabrice-etanchaud closed 2 years ago

fabrice-etanchaud commented 2 years ago

Is there an existing feature request for this?

Describe the Feature

While talking with Mateusz Klimek, the funder of re_data, a dbt native data reliability framework, I found that it would be really helpful to be able to build a given model (and downstream models) only if one of its direct upstream model or source has a given reliability level (a little bit like data freshness thresholds) :


   + -> B(reliability data about A) -> Reliability Test on B
A -+
   +-> C -> D...

There is a model B containing the reliability data (referencing the real model A), the test would be referencing from B. But the real model A would be referenced by other models C, D on a distinct path of the DAG (not following the reliability model). How could I do that ? Prevent the downstream models C, D of the real data model A from building, based on a test failure on another reliability model B ?

This seems to be related to #4016 and #4050

Best regards from Absurdistan's west coast.

Describe alternatives you've considered

Use a generic test based on B defined on A at model level :

version: 2

models:
  - name: A
    tests:
      - re_data.is_reliability_over_threshold:
          threshold: 0.9

{% test is_reliability_over_threshold(model, threshold) %}
  {{ return(adapter.dispatch('is_reliability_over_threshold', 're_data')(model, threshold)) }}
{% endtest %}

{% macro default__is_reliability_over_threshold(model, threshold) %}

    select *
    from {{ ref('B') }}
    where model = '{{ model }}'
    and reliability < {{ threshold }}

{% endmacro %}

But I am not really sure the test would be based on an new build of B. Do generic tests depend (in DAG acception) on other referenced models than the model they are documented ?

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

iknox-fa commented 2 years ago

Hi @fabrice-etanchaud, thanks for reaching out with this feature request.

I believe you can achieve what you're looking for by introducing a forced dependency between Model C and Test B. This should create the behavior necessary to ensure C only runs if B passes regardless of where they live in the dag.

This is probably the cleanest/easiest method available currently. Please follow up if that doesn't fit the bill; we're always interested in hearing about how dbt's execution model does or does not fit the community's use-cases.

fabrice-etanchaud commented 2 years ago

Hi @iknox-fa , thank you for following up ! So in a model definition, I could add a -- depends_on referencing a singular test ? I am not sure how I could write this, could you please give me an example on how to reference a singular test using a depend_on ? Regards

iknox-fa commented 2 years ago

@fabrice-etanchaud Hi! Sorry for the slow response, but here's an example that should be pretty close to your ascii diagram above.

data/seed_data.csv (a table of country data which may or may not include Absurdistan):

"iso3","name","iso2","iso_numeric", (etc...)
"ABW","Aruba","AW","533", (etc...)

models/model_A.sql:

select * from {{ ref('seed_data') }}

models/model_B.sql:

select "name" from {{ ref('model_A') }}

models/model_C.sql:

-- depends_on: {{ ref('model_B') }}
select iso3 from {{ ref('model_A') }}

tests/test_for_absurdistan.sql:

select * from {{ ref('model_B') }} where "name"='Absurdistan'

Note: I may have not been clear about this in my first response-- The dependency you force is between two models, not a model and a test. However given that the upstream model in the defined dep (in this case model_B) has a test, it ensures said test has passed before the dependency is considered met.

The output with "Absurdistan" in the seed data:

23:05:30  Running with dbt=1.0.1
23:05:30  Found 3 models, 1 test, 0 snapshots, 0 analyses, 165 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:05:30  
23:05:31  Concurrency: 1 threads (target='dev')
23:05:31  
23:05:31  1 of 5 START seed file local_testing.seed_data.................................. [RUN]
23:05:32  1 of 5 OK loaded seed file local_testing.seed_data.............................. [INSERT 257 in 1.46s]
23:05:32  2 of 5 START table model local_testing.model_A.................................. [RUN]
23:05:32  2 of 5 OK created table model local_testing.model_A............................. [SELECT 257 in 0.08s]
23:05:32  3 of 5 START table model local_testing.model_B.................................. [RUN]
23:05:32  3 of 5 OK created table model local_testing.model_B............................. [SELECT 257 in 0.05s]
23:05:32  4 of 5 START test test_for_absurdistan.......................................... [RUN]
23:05:32  4 of 5 FAIL 1 test_for_absurdistan.............................................. [FAIL 1 in 0.04s]
23:05:32  5 of 5 SKIP relation local_testing.model_C...................................... [SKIP]
23:05:32  
23:05:32  Finished running 1 seed, 3 table models, 1 test in 1.87s.
23:05:32  
23:05:32  Completed with 1 error and 0 warnings:
23:05:32  
23:05:32  Failure in test test_for_absurdistan (tests/test_for_absurdistan.sql)
23:05:32    Got 1 result, configured to fail if != 0
23:05:32  
23:05:32    compiled SQL at target/compiled/forced_deps_test/tests/test_for_absurdistan.sql
23:05:32  
23:05:32  Done. PASS=3 WARN=0 ERROR=1 SKIP=1 TOTAL=5

and without

23:06:32  Running with dbt=1.0.1
23:06:32  Found 3 models, 1 test, 0 snapshots, 0 analyses, 165 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:06:32  
23:06:32  Concurrency: 1 threads (target='dev')
23:06:32  
23:06:32  1 of 5 START seed file local_testing.seed_data.................................. [RUN]
23:06:34  1 of 5 OK loaded seed file local_testing.seed_data.............................. [INSERT 257 in 1.44s]
23:06:34  2 of 5 START table model local_testing.model_A.................................. [RUN]
23:06:34  2 of 5 OK created table model local_testing.model_A............................. [SELECT 257 in 0.09s]
23:06:34  3 of 5 START table model local_testing.model_B.................................. [RUN]
23:06:34  3 of 5 OK created table model local_testing.model_B............................. [SELECT 257 in 0.05s]
23:06:34  4 of 5 START test test_for_absurdistan.......................................... [RUN]
23:06:34  4 of 5 PASS test_for_absurdistan................................................ [PASS in 0.04s]
23:06:34  5 of 5 START table model local_testing.model_C.................................. [RUN]
23:06:34  5 of 5 OK created table model local_testing.model_C............................. [SELECT 257 in 0.04s]
23:06:34  
23:06:34  Finished running 1 seed, 3 table models, 1 test in 1.80s.
23:06:34  
23:06:34  Completed successfully
23:06:34  
23:06:34  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
fabrice-etanchaud commented 2 years ago

Great, thank you @iknox-fa , I should have found it out by myself, thanks ! The only limitation with that solution is that I would have to create as many tested models as branching conditions (imagine I would like to update different models based on different values of a reliability metric).

Couldn't that solution be used to simulate source freshness thresholds ?

Best regards, Fabrice

jtcohen6 commented 2 years ago

Couldn't that solution be used to simulate source freshness thresholds ?

I think so! Anything you can write SQL for :)

I'm going to close this issue, since I believe we have (most of) the (mostly) right constructs already in place