calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.01k stars 123 forks source link

[WIP] Add test for a percentage of rows to be null #241

Closed danhphan closed 11 months ago

danhphan commented 1 year ago

This [Work in Progress] PR addresses the issue #238 that will allow for percentage of rows to be null.

CC @clausherther Please let me know if it needs any further changes.

Thank you.

danhphan commented 1 year ago

FYI: I have added a column named col_null_percentage in the data_test.sql file for testing purpose. This column have 4 rows including one null row. So the null percentage will be 1/4 = 0.25. And in the integration_tests/models/schema_tests/schema.yml file, added the test for that column will less than 0.3. Thank you.

      - name: col_null_percentage
        tests:
          - dbt_expectations.expect_column_null_percentage_to_be_less_than:
              value: 0.3
danhphan commented 1 year ago

Hi @clausherther I am not sure why the CircleCI is still running. I run all dbt test on a local machine and it was quite fast (a couple of minutes).

Any suggestions on this. Thank you!

clausherther commented 1 year ago

@danhphan I have to approve it first, will do now.

clausherther commented 1 year ago

I'll take a look at the PR itself this weekend and give feedback then.

clausherther commented 1 year ago

@danhphan I think you'll need to rebase to get CI to pass (I remove SQLFluff support for now since their latest release broke things). I wasn't able to rebase your branch from my side.

clausherther commented 1 year ago

I was able to rebase your branch, but left comments re: the test. I think essentially, we'd need to add those tolerance params to expression_is_true, which is a more extensive change than maybe you had in mind. Let me know if you want to try and tackle that, or if you'd rather have me help with that.

danhphan commented 1 year ago

Hi @clausherther, yes, I think adding these tolerance params will allow better generalisation.

Let me work on that. If I face any issues, I will ask for your help :) Thank you.

danhphan commented 1 year ago

Hi @clausherther

I would like to have a further discussion to make sure that I will go to the right direction.

So we will extend the expression_is_true test with optional tolerance and tolerance_percent parameter to something like this, right?:

test expression_is_true(model,
                                 expression,
                                 test_condition="= true",
                                 group_by_columns=None,
                                 row_condition=None,
                                 tolerance=None,
                                 tolerance_percent=None
                                 )

And the tolerance here means an absolute value that count(expression = false) < tolerance, and the tolerance_percent is about a ratio with count(expression = false) / count(*) < tolerance_percent.

For example: if we expect that a column is not null with a tolerance_percent of 5%, which means less than 5% of values in that column will be null

so expression = column_name is not null and the test for tolerance_percent will be something like: count( not (column_name is not null) ) / count(*) < 5%

Thank you.

danhphan commented 1 year ago

I am thinking to change the expression_is_true test to:

with grouped_expression as (
    ...
),

tolerance_expression as (
    select ( sum (expression is false) * 1.0 / count(*) ) < {{ tolerance_percent }} as tolerated_expression
    from
        grouped_expression    
) ,

validation_errors as (
    select
        *
    from
        tolerance_expression
    where
        not(tolerated_expression {{ test_condition }})
)

But not sure if it will work for the case when we have a group_by_columns parameter.

In addition, is there a test in the codebase for the case of using group_by_columns parameter in the expression_is_true test?

Thank you!

jenna-jordan commented 9 months ago

Just curious why this PR got closed, and whether there are any plans to add this feature? (we commonly test that, for example, >90% of values are not null, using Great Expectations)