calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.04k stars 126 forks source link

--store-failures with query context #185

Open artem110101 opened 2 years ago

artem110101 commented 2 years ago

Hi!

I have question / suggestion regarding behaviour of --store-failures when used together with dbt-expectations.

We are looking to use dbt-expectations, but we are lacking some traceability features.

Let's look at this meaningless column level assertion / test (from https://github.com/dbt-labs/jaffle_shop):

  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
                min_value: 10
                max_value: 40

When dbt is run with --store-failures, single boolean column will be created in the persistent storage. To make this data source of truth and actionable, we want to add context or details to the results of the query. For example:

  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
                query_context: 'order_id, customer_id, order_date, status'
                min_value: 10
                max_value: 40

This can be archived with simple changes to the macros. You can see example here:

https://github.com/calogica/dbt-expectations/compare/main...badal-io:dbt-expectations:add_failure_context

This way, result table can have more context and details (row-level, in some cases), which might be useful not just for us.

image

Would love to get some feedback feasibility of this. We are happy to create PR from our side.

This is possible to implement without breaking any of the existing tests for current users.

artem110101 commented 2 years ago

query context can also be array of strings, raw SQL can be riskier here.

clausherther commented 2 years ago

Hi @artemboiko1, thanks for opening this issue! That's a very interesting proposal that I have to think about more deeply. My initial thoughts are that this a) would have to be extensively tested since it adds potentially new grouping levels to tests that rely on group_by parameters and b) as you point out, could lead to all kinds of unintended second level effects from folks adding random query context strings. I'd prefer is query context was natively handled within dbt-core and the store_failures feature, rather than sort of hacking it here. Have you brought this up at all with the dbt-labs folks?

artem110101 commented 2 years ago

@clausherther

thanks a lot for the response!

actually, dbt native assertions do store more data when used with --store-failures, but dbt-expectations does not. I'm not sure what can be done at dbt level, because those queries are complied at package/macros level. And we don't want to do query multiple times.

artem110101 commented 2 years ago

@clausherther

dbt-core schema assertions already select * as the context for the failure, example:

select *
from `ld-snowplow`.`jaffle_shop`.`stg_orders`
where order_id is null

for this model

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

so the issue is dbt-expectations-specific

artem110101 commented 2 years ago

@clausherther

please let me know if there is anything from our side that I can do to let this happen. I would be happy to address your concerns.

clausherther commented 2 years ago

Hi @artemboiko1 - sorry, haven't had a ton of time to look into this. However, one thought I had was that rather than adding a query_context parameter to every test, I'd probably prefer we follow something similar to the dbt-core model and simply 1) emit all columns from the model (select *) for non-grouped tests, or 2) emit all grouped columns from model when the test is on aggregated data

This would still require going through all tests to make sure this works as intended. Also, I wouldn't know how to test this in the integration_tests suite. Any ideas?

artem110101 commented 2 years ago

@clausherther

I think that should work. Regarding integration_tests, can't saying at the moment, have to dig deeper into it. Do you think this change should come in a single MR or are you okay with multiple smaller ones? For example starting with simplest tests.

In the perfect world I would love to have flexibility of selecting only specific columns. Potentially, in some cases, this can get expensive. Number of columns can be huge. But, in this case, this had to be changed in dbt-core also.

Just a small note: this will change current behaviour. Most likely no one cares or depends on the current result of the --store-failures. Let me know if you think there is some unintended consequences to this.

I would happy to create MR with this order of things:

  1. Covering non-aggregate tests (MR 1)
  2. Updating integration_tests (MR 1)
  3. Aggregate queries (MR 2)
  4. Updating integration_tests (MR 2) <- updated

What do you think?

clausherther commented 2 years ago

That order of pull requests works for me. Although 3. should also be followed by a check of the integration_tests, right? In any case, I would want to release all of these as version 0.60 since it's a potentially breaking change.

artem110101 commented 2 years ago

should also be followed by a check of the integration_tests, right?

yes, my bad, thanks! I will rework my MR to reflect discussed changes.

thanks a lot for taking your time to discuss this! ❤️

clausherther commented 2 years ago

Awesome, looking forward to this! 👍

James-DBA-Anderson commented 2 years ago

Wow I have just spent the last two days trying to figure out how to do just this!

I ended up overriding the dbt_expectations tests with slightly modified versions that accepted a contextual_column_list param. I then added a loop in the set expression to write this columns out with a max aggregation.

{% test expect_multicolumn_sum_to_equal_with_context(
    model,
    column_list,
    sum_total,
    group_by=None,
    row_condition=None,
    contextual_column_list=None
    ) %}

{% set expression %}

{% for column in contextual_column_list %}
max({{ column }}) as {{ column }},
{% endfor %}

{% for column in column_list %}
sum({{ column }}){% if not loop.last %} + {% endif %}
{% endfor %} = {{ sum_total }}
{% endset %}

{{ dbt_expectations.expression_is_true(model,
    expression=expression,
    group_by_columns=group_by,
    row_condition=row_condition
) }}

{% endtest %}

This allows me trace much more about test runs.

I'd like to get the start time, end time and invocation id of the run stored so that I could see how tests have behaved over multiple runs.

I'd also like to change the audit table naming convention so I don't have to check all audit tables. Ideally the logs would all go in one table with the test name as column.

I'll keep an eye on this ticket and may be able to help if needed.

reinososamanta commented 1 year ago

This will be great! thanks!

reinososamanta commented 1 year ago

Hi guys, Just wondering if these changes were going forward. It would be really useful to have this :)

sambloom92 commented 1 year ago

I would also find this really useful if it can be done

YevgenyLevi commented 1 year ago

Hi @clausherther. Please, let us know if it is going forward, I am considering building my own tests only to implement the same changes.

Thanks!

clausherther commented 1 year ago

Hi @YevgenyLevi - this is an open source project. If there's not enough interest in the community to implement this feature in the repo, then it will likely not happen from anywhere else.

adamcunnington-mlg commented 1 year ago

@clausherther just an extra thought on this. I actually ran into the same thing. I think it's a shame that dbt-labs do not enforce that ALL tests created by anyone should do some standard {{ "*, " if should_store_failures() }} thing.

Regarding group by - the workaround is to use window functions and never group-by.

I think it should be mandatory that every single test returns row-level information, without aggregation.

In the meantime, please can you advise how one would go about overriding a dbt_expectations test? This is perhaps more a dbt question but I know you'll have the answer! It's not as straightforward as customising a builtin test because they are namespaced within the dbt_expectations package. Also, can you clarify which bits I should override - e.g. in the example of this test, there is a test block that calls macro block that calls default__ version of macro. Which bits should I implement when overriding?

adamcunnington-mlg commented 1 year ago

Ah, answer to my question: https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch#overriding-package-macros

Although, I'm not sure if it's possible to override a namespaced test that does not use adapter.dispatch :( e.g. https://github.com/calogica/dbt-expectations/blob/main/macros/schema_tests/column_values_basic/expect_column_values_to_be_in_set.sql