calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.04k stars 126 forks source link

`expect_compound_columns_to_be_unique` has backward logic for `ignore_row_if` #200

Closed mcannamela closed 1 year ago

mcannamela commented 1 year ago

Caveat, it's possible I've just misunderstood the intended semantics. If so, sorry.

What

I noticed when using expect_compound_columns_to_be _unique that it passed when I expected it to fail, evidently because it was not including records with any nulls in its test.

The docs read:

ignore_row_if: "any_value_is_missing" # (Optional. Default is 'all_values_are_missing')

Which led me to believe all rows would be included so long as any field was non-null. I thought perhaps the default was wrong, but specifying "all_values_are_missing" explicitly gave the same result. When I changed to "any_value_is_missing", the test went from passing to failing.

Inspecting the rendered SQL showed me something like:

ignore_row_if: "all_values_are_missing"

with validation_errors as (

    select
        g1,g2,g3
    from dbt_cannamm_dev_macro_tests.test_dbt_expectations__expect_compound_columns_to_be_unique
    where
        1=1
        and
        (
            g1 is not null and
            g2 is not null and
            g3 is not null

        )

    group by
        g1,g2,g3
    having count(*) > 1

)
select * from validation_errors

You can see that it's selecting where "all fields are not null" for inclusion in the test; but the negation of "all fields are not null" (i.e. the "ignored" rows) is "any field is null", the opposite of what the arguments specified.

Looking a the code bears this out:

{%- set row_condition_ext -%}

{%- if ignore_row_if == "all_values_are_missing" %}
        (
            {% for column in columns -%}
            {{ column }} is not null{% if not loop.last %} and {% endif %}
            {% endfor %}
        )
{%- elif ignore_row_if == "any_value_is_missing" %}
        (
            {% for column in columns -%}
            {{ column }} is not null{% if not loop.last %} or {% endif %}
            {% endfor %}
        )
{%- endif -%}

...elided...

where
        1=1
    {%- if row_condition_ext %}
        and {{ row_condition_ext }}
    {% endif %}

Amusingly, since the ignore_row_if argument is not guarded against invalid values like the others, one can access a third behavior ("don't ignore any rows") by passing in any other string.

Reproduction

Here are (failing) "unit tests" in dbt that show the behavior:

test_dbt_expectations__expect_compound_columns_to_be_unique.sql

WITH setup AS (
    SELECT g1, g2, g3
    FROM VALUES
             (null, null, null),
             (null, null, null),
             (null, null, 1),
             (null, null, 1),
             (1, null, 1),
             (null, 'a', 2),
             (1, 'a', 3),
             (1, 'b', 4),
             (2, 'a', 5),
             (2, 'b', 6)
              AS t(g1, g2, g3)
)
select *
from setup

test_dbt_expectations__expect_compound_columns_to_be_unique.yml

version: 2

expect_to_fail: &exp_fail
    fail_calc: '(count(*)=0)::int'

models:
  - name: test_dbt_expectations__expect_compound_columns_to_be_unique
    description: >
      This test proves how the dbt_expectations test `expect_compound_columns_to_be_unique` works, since the behavior of
      `ignore_row_if` is unclear.

    tests:
      - dbt_expectations.expect_table_row_count_to_equal:
          value: 9

      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [ 'g1', 'g2' ]
          ignore_row_if: "any_value_is_missing"
          config:
            meta:
              comment: "Passes even in the presence of degeneracy if rows with any null are skipped"

      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [ 'g1', 'g2', 'g3' ]
          ignore_row_if: "all_values_are_missing"
          config:
            <<: *exp_fail
            meta:
              comment: "Fails when rows with some nulls are degenerate in non-null columns and we only skip if all values are null"

      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [ 'g1', 'g2', 'g3' ]
          ignore_row_if: ""
          config:
            <<: *exp_fail
            meta:
              comment: "Fails when rows with some nulls are degenerate in non-null columns and we only skip if all values are null"

      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [ 'g1', 'g2', 'g3' ]
          ignore_row_if: "any_value_is_missing"
          config:
            meta:
              comment: "Passes even in the presence of degeneracy if rows with any null are skipped"

      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [ 'g1', 'g2', 'g3' ]
          config:
            <<: *exp_fail
            meta:
              comment: "Should fail by default if `ignore_row_if` defaults to 'all_values_are_missing' as the docs claim"

Suggested fix

Reversing the bodies of the if branches should give the expected behavior.

It would probably also be wise to add an explicit "never_ignore" option and guard against users passing nonsense values for ignore_row_if that bypass the logic.

I guess it's worth pointing out that this will be very surprising for folks who have been using this macro and suddenly have a bunch of failing tests, but I don't imagine there is any avoiding that.

clausherther commented 1 year ago

Thanks @mcannamela for this issue and the extensive explanation. It's quite possible I mucked this up when I implemented this. The genesis of this whole any/all missing section comes from the original Great Expectations API.

However, I think the logic is correct the way I understood the intent of this parameter. Let's say we're testing the uniqueness of the combo of A and B, then we either want to test:

Does that make sense? Did I get those booleans right?

Happy to accept any PRs against this issue, and we can figure out how to deal with potentially breaking changes when we release a fix.

clausherther commented 1 year ago

Taking a second look, I think I see what you're saying. I think the fix might be to move the not outside the eval, so maybe

{%- if ignore_row_if == "all_values_are_missing" %}
        not (
            {% for column in columns -%}
            {{ column }} is null{% if not loop.last %} and {% endif %}
            {% endfor %}
        )
{%- elif ignore_row_if == "any_value_is_missing" %}
        not (
            {% for column in columns -%}
            {{ column }} is null{% if not loop.last %} or {% endif %}
            {% endfor %}
        )
{%- endif -%}

which I would probably shorten to just change the and to or conditional on ignore_row_if,

{%- set row_condition_ext -%}

{%- if row_condition  %}
    {{ row_condition }} and
{% endif -%}

{%- set op = "and" if ignore_row_if == "all_values_are_missing" else "or" -%}
    not (
        {% for column in columns -%}
        {{ column }} is null{% if not loop.last %} {{ op }} {% endif %}
        {% endfor %}
    )
{%- endset -%}
clausherther commented 1 year ago

I get the above to pass if I add these somewhat contrived tests to schema.yml

        - dbt_expectations.expect_compound_columns_to_be_unique:
            column_list: ["date_col", "col_null"]
            ignore_row_if: "any_value_is_missing"
        - dbt_expectations.expect_compound_columns_to_be_unique:
            column_list: ["col_null", "col_null"]
            ignore_row_if: "all_values_are_missing"
clausherther commented 1 year ago

Added a quick PR (#202) to capture the above, happy to colab on this if you think more checking of the inputs is warranted!

mcannamela commented 1 year ago

Thanks @clausherther, I think you are correct in your reformulation, I'll have a look at the PR today.