Closed danhphan closed 11 months ago
FYI: I have added a column named col_null_percentage
in the data_test.sql
file for testing purpose. This column have 4 rows including one null row. So the null percentage will be 1/4 = 0.25. And in the integration_tests/models/schema_tests/schema.yml
file, added the test for that column will less than 0.3. Thank you.
- name: col_null_percentage
tests:
- dbt_expectations.expect_column_null_percentage_to_be_less_than:
value: 0.3
Hi @clausherther I am not sure why the CircleCI is still running. I run all dbt test
on a local machine and it was quite fast (a couple of minutes).
Any suggestions on this. Thank you!
@danhphan I have to approve it first, will do now.
I'll take a look at the PR itself this weekend and give feedback then.
@danhphan I think you'll need to rebase to get CI to pass (I remove SQLFluff support for now since their latest release broke things). I wasn't able to rebase your branch from my side.
I was able to rebase your branch, but left comments re: the test.
I think essentially, we'd need to add those tolerance params to expression_is_true
, which is a more extensive change than maybe you had in mind. Let me know if you want to try and tackle that, or if you'd rather have me help with that.
Hi @clausherther, yes, I think adding these tolerance params will allow better generalisation.
Let me work on that. If I face any issues, I will ask for your help :) Thank you.
Hi @clausherther
I would like to have a further discussion to make sure that I will go to the right direction.
So we will extend the expression_is_true test with optional tolerance
and tolerance_percent
parameter to something like this, right?:
test expression_is_true(model,
expression,
test_condition="= true",
group_by_columns=None,
row_condition=None,
tolerance=None,
tolerance_percent=None
)
And the tolerance
here means an absolute value that count(expression = false) < tolerance
, and the tolerance_percent
is about a ratio with count(expression = false) / count(*) < tolerance_percent
.
For example: if we expect that a column is not null with a tolerance_percent
of 5%, which means less than 5% of values in that column will be null
so expression
= column_name is not null
and the test for tolerance_percent
will be something like: count( not (column_name is not null) ) / count(*) < 5%
Thank you.
I am thinking to change the expression_is_true test to:
with grouped_expression as (
...
),
tolerance_expression as (
select ( sum (expression is false) * 1.0 / count(*) ) < {{ tolerance_percent }} as tolerated_expression
from
grouped_expression
) ,
validation_errors as (
select
*
from
tolerance_expression
where
not(tolerated_expression {{ test_condition }})
)
But not sure if it will work for the case when we have a group_by_columns
parameter.
In addition, is there a test in the codebase for the case of using group_by_columns
parameter in the expression_is_true
test?
Thank you!
Just curious why this PR got closed, and whether there are any plans to add this feature? (we commonly test that, for example, >90% of values are not null, using Great Expectations)
This [Work in Progress] PR addresses the issue #238 that will allow for percentage of rows to be null.
CC @clausherther Please let me know if it needs any further changes.
Thank you.