Automating Non Regression Test : How to get a ref() to the deferred version of the selected model ?

fabrice-etanchaud commented 3 years ago

Describe the feature

Based on the new --defer feature, one would like to automate non regression test (NRT), comparing local model data with the same deferred model data. How one could get the ref() of the deferred model to be given to a NRT macro comparing the common columns on the common primary key values ?

Describe alternatives you've considered

No alternative so far

Additional context

Who will this benefit?

All dbt users who want to automate non regression testing.

Are you interested in contributing this feature?

Yes !

jtcohen6 commented 3 years ago

Interesting idea, @fabrice-etanchaud! Based on how we've defined --defer to work in v0.18.0, a model can either be selected or deferred at runtime. I don't think the current behavior would support ref'ing the same node across multiple manifests.

That said, I agree that this would be really compelling. The foundational work we've done around artifact/state comparison—the surface of which we've just barely scratched—could totally make something like this possible in the future. Maybe we add a "defer" argument to ref, like "use the other one if it's available," such that we could code up a new audit-helper macro ???

fabrice-etanchaud commented 3 years ago

Hi @jtcohen6 ,

currently I am using the macro below to check for non regression. I use it in deferred mode, the unmanaged model is the dev model, and in order to ref the managed version of the same model (in production, for example), I have to mention the database, schema and identifier explicitly. Having a way to use ref() instead would be imho a terrific feature. Is there any plan ? Can I help ?

Best, Fabrice


{% macro non_regression_test(unmanaged_model, managed_model_database, managed_model_schema, managed_model_identifier, pk) %}
-- depends_on: {{ ref(unmanaged_model) }}
{% set unmanaged_model = ref(unmanaged_model) %}
{% set managed_model = api.Relation.create(database=managed_model_database, schema=managed_model_schema, identifier=managed_model_identifier) %}
{% set managed_cols = adapter.get_columns_in_relation(managed_model) %}
{% set managed_colnames = managed_cols | map(attribute='name') | list %}
{% set unmanaged_cols = adapter.get_columns_in_relation(unmanaged_model) %}
{% set unmanaged_colnames = unmanaged_cols | map(attribute='name') | list %}
{% set pk_cols = managed_cols | selectattr('name', 'in', pk) | list %}
{% set pk_colnames = pk_cols | map(attribute='name') | list %}
{% set nrt_cols = unmanaged_cols | rejectattr('name', 'in', pk) | selectattr('name', 'in', managed_colnames) | list %}
{% set nrt_colnames = nrt_cols | map(attribute='name') | list %}

select
  md5(concat({{ columns_csv('managed', pk_cols, alias=false) }})) as tnr_id
  ,{{ columns_csv('managed', pk_cols) }}
  ,'difference' AS change_type
  ,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged') }}
  ,{{ columns_csv('managed', nrt_cols, prefix='managed') }}
from {{ managed_model }} as managed
join {{ unmanaged_model }} as unmanaged
  on ({{ join_on_columns('managed', 'unmanaged', pk_cols) }})
where {{ columns_inequality('managed', 'unmanaged', nrt_cols) }}

union all

select
  md5(concat({{ columns_csv('managed', pk_cols, alias=false) }}))
  ,{{ columns_csv('managed', pk_cols, alias=false) }}
  ,'missing'
  ,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged', nullify=true, alias=false) }}
  ,{{ columns_csv('managed', nrt_cols, prefix='managed', alias=false) }}
from
  {{ managed_model }} as managed
where not exists (
  select 1
  from {{ unmanaged_model }} as unmanaged
  where ({{ join_on_columns('unmanaged', 'managed', pk_cols) }})
)

union all

select
  md5(concat({{ columns_csv('unmanaged', pk_cols, alias=false) }}))
  ,{{ columns_csv('unmanaged', pk_cols, alias=false) }}
  ,'unexpected'
  ,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged', alias=false) }}
  ,{{ columns_csv('managed', nrt_cols, prefix='managed', nullify=true, alias=false) }}
from
  {{ unmanaged_model }} as unmanaged
where not exists (
  select 1
  from {{ managed_model }} as managed
  where ({{ join_on_columns('managed', 'unmanaged', pk_cols) }})
)

{% endmacro %}

{%- macro columns_csv(model, columns, alias=true, prefix=none, nullify=false) -%}
    {%- for column in columns -%}
        cast(
        {%- if not nullify -%}
          {{ model }}.{{ column.name }}
        {%- else -%}
          null
        {%- endif %} as {{ column.data_type }}){% if alias %} as
        {%- if prefix is not none %} {{ prefix }}_{%- else %} {% endif -%}{{ column.name }}{% endif %}
        {%- if not loop.last %}, {% endif -%}
    {% endfor -%}
{%- endmacro -%}

{%- macro join_on_columns(model1, model2, columns) -%}
    {%- for column in columns -%}
        {{ model1 }}.{{ column.name }} = {{ model2 }}.{{ column.name }}
        {%- if not loop.last %} and {% endif -%}
    {%- endfor -%}
{%- endmacro -%}

{%- macro columns_inequality(model1, model2, columns) -%}
    {%- for column in columns -%}
      {{ model1 }}.{{ column.name }} is distinct from {{ model2 }}.{{ column.name }}
      {% if not loop.last %} or {% endif -%}
    {%- endfor -%}
{%- endmacro -%}

jtcohen6 commented 3 years ago

Okay, here's what I'm thinking about:

A builtin defer_ref() macro. By default, it works exactly the same as ref(). In the event that you've passed --state, however, it will instead resolve to the namespace of the referenced relation provided by the state-comparison manifest.

It would be a close cousin of --defer, with two distinctions:

Finer-grained control. This macro says, "For one specific resource, I would like to defer its reference if possible." The --defer flag, by contrast, defers references for all unselected resources. In this world, we could think about the CLI flag as "switching" ref calls to defer_ref calls depending on the node selection criteria.
The ability to defer a self-reference. Using defer_ref(this.name), it would be possible to get this model from the other namespace. Then your regression check could look like:
```
{{ audit_helper.compare_relations(
a_relation=this,
b_relation=defer_ref(this.name),
exclude_columns=[],
primary_key="id"
) }}
```

What do you think @fabrice-etanchaud?

fabrice-etanchaud commented 3 years ago

Hello @jtcohen6 ,

this is why I really love dbt : it's not only efficient, but also elegant ! Non regression tests could not be written more simply.

Thinking out loud, couldn't this help implement model schema awareness ?

{{ audit_helper.compare_relation_columns(
    a_relation=this,
    b_relation=defer_ref(this.name)
) }}

boxysean commented 3 years ago

I'm also interested in non-regression tests (new term to me, thanks @fabrice-etanchaud!) and followed the rabbit-hole to here. :-)

During a dbt run in support of a non-regression test, I would also like a run-time option for my ref() to be switched with its normal parent node to a "test" node.

The strength of the --defer flag seems to be that ref() is able to be switch between namespaces, and makes it appealing for this use-case. However, the --defer flag appears to me to have been built with a different intention. So I question whether --defer provides the right mindset for non-regression testing.

Going a little off-topic from --defer... I was toying with this idea today in Jaffle Shop to "intercept" the ref() call and return a static dataset during dbt --run:

{% macro ref(model_name) %}
    {% if model_name == 'dim_customers' %}
        (
            select 1 as customer_id, 'mperez0@chronoengine.com' as email union all
            select 2 as customer_id, 'smccoy1@reddit.com' as email
        )
    {% elif model_name == 'raw_customers' %}
        (
            select 1 as id, 33 as customer_lifetime_value union all
            select 2 as id, 23 as customer_lifetime_value
        )
    {% else %}
        {% do return(builtins.ref(model_name)) %}
    {% endif %}
{% endmacro %}

And the data test would look like:

{% set test_dataset_query %}
select 'mperez0@chronoengine.com' as email, 33 as customer_lifetime_value union all
select 'smccoy1@reddit.com' as email, 23 as customer_lifetime_value
{% endset %}

{% set query_under_test %}
select
    email,
    customer_lifetime_value
from {{ ref('dim_customers') }} as dim_customers
left join {{ ref('raw_customers') }} as raw_customers
    on raw_customers.id = dim_customers.customer_id
{% endset %}

{{ audit_helper.compare_queries(
    a_query=test_dataset_query,
    b_query=query_under_test,
    primary_key="email"
) }}

The downside of this being that ref() macro will get quite unwieldy, so am thinking some other way is better. Lmk what you think!

jtcohen6 commented 3 years ago

@boxysean I really like the connections coming out of this issue—I'm finding this problem space very generative, even though I'm not yet sold on any one solution in particular (including my hypothetical defer_ref idea above).

I too was thinking about mock data today, having spent some time this morning writing out #2857 and thinking about improved UX for dbt test. You've just closed a mental loop for me.

Imagine that, for each important_model you want to test, you have a seed file named mock__important_model.csv. Here's something that is possible today by overriding the ref() macro:

{% macro ref(model_name) %}
  {% if var('mock', false) %}
    {% set rel = builtins.ref('mock__' ~ model_name) %}
  {% else %}
    {% set rel = builtins.ref(model_name) %}
  {% endif %}
  {% do return(rel) %}
{% endmacro %}

Then dbt test --vars 'mock: true' would execute your tests against the mock data instead. That feels a bit silly, since it's really just unit testing the SQL of your tests, but if you have a number of custom schema and data tests, it could be valuable. (The way I've written it above does presume that you'd have a mock__ seed for every model, which may be unrealistic, so it'd need a bit of tweaking.)

Even more interesting would be overriding the source() macro, with the opportunity to do fancier things for our incremental or snapshot logic:

{% macro source(source_name, table_name) %}

  {% if var('mock', false) %}
    {% set src = builtins.ref('mock__' ~ source_name ~ '__' ~ table_name) %}
  {% elif var('mock_update', false) %}
    {% set src = builtins.ref('mock_update__' ~ source_name ~ '__' ~ table_name) %}
  {% else %}
    {% set src = builtins.source(source_name, table_name) %}
  {% endif %}
  {% do return(src) %}

{% endmacro %}

You could then easily run an entire DAG of transformations and tests on top of fixture data, including equality tests against "expected" results. That's more or less how we run integration tests for open source dbt packages such as dbt_utils and snowplow.

In my personal view, I think that's more trouble than it's worth at most organizations—automatically generating realistic and up-to-date fake data is a massive undertaking—but I know that a rigorous and extensible unit testing framework is on the minds of several community members. Over in #2593, @bashyroger kindly linked me to a quite thoughtful post about what's missing from dbt testing today, and I've been mulling it over for the past month. In particular:

There is no tool to let us deterministically re-create a set of conditions, execute our code and validate the output. That is a basic definition of what tests are, yet we struggled to achieve the same with the so-called tests in dbt.

I think that piece would be possible with the custom ref/source override above. The requirement would be having mock data CSVs that are stable and stored in the data/ folder, or an external orchestration process that can generate on-the-fly CSVs (e.g. with faker) in advance of dbt seed. (Admittedly, this approach wouldn't do anything in the direction of a Gherkin / behave interface, but there are other ongoing conversations around improving the UX of defining custom schema and data tests.)

Speaking of interesting connections, @fabrice-etanchaud, I'd be curious to hear more about how you see this tying into schema awareness. Is it the ability to compare adapters.get_column_in_relation(this) and adapters.get_column_in_relation(defer_ref(this.name))? What about PRs when you're intentionally changing the schema of a model, compared to the version currently in production?

fabrice-etanchaud commented 3 years ago

@jtcohen6 , you are right, testing that schema is not evolving between prod and dev is not an model "invariant" test, but an ad hoc test. There are discussions elsewhere on how to build a schema repository, that's for sure the way to go. Best

jmriego commented 3 years ago

I have been playing around with a way to automate this and I have a working concept here: https://github.com/jmriego/dbt-bdd

The tests are run with behave which is a library for BDD testing which in my opinion is a great fit for DBT as it makes the tests easy to understand by analysts the same way it already does for ELT. Here's an example unit test for a model that counts number of days in a month:

  Scenario: run a sample unit test
     Given calendar is loaded with this data
         | dt         | first_day_of_month |
         | 2020-01-01 | 2020-01-01         |
         | 2020-01-02 | 2020-01-01         |
         | 2020-02-02 | 2020-02-01         |
      When we run the load for fom
      Then the results of the model are
         | first_day_of_month | count_days |
         | 2020-01-01         | 2          |
         | 2020-02-01         | 1          |

The main trick is that the phrase "calendar is loaded with this data" will generate a seed file with that info and with a name specific to this test that it generates automatically. It does that with a scenario id prefix (i.e. abcd124_calendar).

Then, it will replace all ref to calendar with abcd124_calendar. This is really the main trick and I didn't find a better solution, but it does so by passing to dbt a var with the following key and value: {calendar: abcd124}. The code that detects the reference is here: https://github.com/jmriego/dbt-bdd/blob/master/macros/ref.sql

Before spending too much time on this, what do you all think of this approach? Does it make sense to continue in this direction?

jtcohen6 commented 3 years ago

@jmriego This looks cool! We played around with using behave to sketch out a few very basic sequences over in dbt-integration-tests. For a time, that was the repo used for testing adapter plugins, until we built a dedicated one with pytest.

Out of curiosity, why the need for the scenario ID prefix? We do something very similar for dbt integration tests, so as to reuse the same set of models while avoiding database namespace collisions across scenarios or testing invocations. If that's your reason as well, could you simply pass {{ var('scenario') }} into the alias config of those models?

Taking a step back, it would be great to move this line of conversation over to #2354, or into discourse. I think there's a lot of folks who are interested in the broader concept of unit testing / fixed data testing with dbt, and currently we've got various thoughts spread across long threads, tucked under niche feature requests.

jmriego commented 3 years ago

thanks @jtcohen6 ! you're right, the idea about the scenario id prefix is to avoid collisions across scenarios. Passing a variable into the alias config sounds like would be simpler but I'm not sure how to do that. Can you give me a bit more info about how would that work? Where should that change be made(every model .sql file, changing the ref function)?

I'll continue the conversation in those two places. It would be great to have dbt have this kind of capability

jtcohen6 commented 3 years ago

@jmriego I think a custom generate_alias_name macro (docs) could be a big help to you here:

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
    {% set aliased = custom_alias_name | trim if custom_alias_name else node.name %}
    {% set prefix = var('scenario') ~ '_' if var('scenario') else '' %}
    {% set identifier = prefix ~ aliased %}
    {{ return(identifier) }}
{%- endmacro %}

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

dbt-labs / dbt-core