calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.01k stars 123 forks source link

expect_column_distinct_count_to_equal fails if there is no data! #214

Closed adamcunnington-mlg closed 1 year ago

adamcunnington-mlg commented 1 year ago

We use expect_column_distinct_count_to_equal to assert there are x distinct values.

If there are no rows, there are 0 distinct values - but clearly this shouldn't fail when there are no rows.

Is it right for the test to be tweaked so this doesn't fail - or do you think I should include some where condition? What would my where condition be to say "where there are rows"? Maybe it's a test-specific arg that you should support which acts as a flag for the behaviour?

Keen for thoughts, thanks.

clausherther commented 1 year ago

Hi @adamcunnington-mlg, which DW platform are you using? On my Postgres, BigQuery and Snowflakes instances, this passes:

          - dbt_expectations.expect_column_distinct_count_to_equal:
              row_condition: 1=2 # to simulate an empty table
              value: 0

All this test does is check that count(distinct col) = 0, so as long as count(distinct col) doesn't return null in your case, this should pass.

Am I understanding your question correctly?

adamcunnington-mlg commented 1 year ago

Hi @clausherther, we are using BigQuery. I guess I could add row_condition: false - but semantically, I wondered whether this should be built-in functionality. Very happy to add that condition though for now.

clausherther commented 1 year ago

If your table really contains now rows, than this should work (I only add that row_condition to simulate an empty table, since I didn't have one handy)

- dbt_expectations.expect_column_distinct_count_to_equal:
      value: 0

This essentially runs

select count(distinct col) = 0
from data

which returns true for an empty table.

clausherther commented 1 year ago

Still not sure I understand your question or use case correctly.

clausherther commented 1 year ago

Also confirmed that value: 0 passes if the tables is not empty, but the column only contains nulls.

adamcunnington-mlg commented 1 year ago

@clausherther ok, so my use case is very simple:

I have just realised that row_condition is effectively just a where. I can't use this. I am not able to isolate the state of "table is empty" from a where that operates on table rows.

I think my point of view is that when the table is empty, this test is not relevant and should not fail - and this should either be built-in behaviour of the test itself or some custom parameter that behaves like a flag, e.g. ignore_if_empty.

clausherther commented 1 year ago

@adamcunnington-mlg where I'm confused is that based on my tests this morning (see above), the test should not fail in your case. Can you provide more detail on your test config?

adamcunnington-mlg commented 1 year ago

@clausherther in your example, you are asserting there are 0 distinct values. I am not asserting there will be 0. I will not be changing my test based on whether my table contains data or not.

Or do you expect a distinct count of 1 test to also pass when the table is empty?

clausherther commented 1 year ago

Gotcha. So, I think if your table is empty and you're asserting that the column should contain 1 or more distinct values, then the test should fail. That is, this is working as intended. In a data warehouse, I don't see value in empty tables other than during development, where we can choose not to run tests, so I don't think we'd want to design test arguments around that use case.

adamcunnington-mlg commented 1 year ago

There can be a huge range of reasons why a table might be empty.

I need to be able to isolate "empty table" without channging my model code to avoid failure. How can I do that or would you need to add a custom parameter?

clausherther commented 1 year ago

Could you expand a bit on those reasons? I'm curious why an empty production fact or dimension table would be valuable in a DW setting?

clausherther commented 1 year ago

I would probably recommend implementing expect_column_distinct_count_to_equal locally for your use case, with something like count(distinct col) == {{ value }} or count(distinct col) = 0.

adamcunnington-mlg commented 1 year ago

fact/dimension? It sounds like you are considering a traditional data warehouse data model only.

Big data & modern ELT patterns have given rise to many different data models. In our case, we have a concept of data feeds which we onboard instances of many times over (as there is a degree of similarity in data needs across our client base). In between "onboarding" and actually turning the schedule on, our tables will be empty. The tests for this table come from a template which we have dynamically rendered - we have an abstraction layer around dbt - we don't expect data assertions to fail for an empty table before it is first populated.

I think you are imagining a typical traditional BI setup where the data model is a fairly known and fixed quantity. In our world, we are constantly adding new feeds all the time - and onboarding other instances of existing feeds all the time too.

For now, I could get around this by changing equal to be less than - but then I'm compromising on the check that I'm doing when the table is non-empty.

Thanks for the consideration though - I appreciate it.

clausherther commented 1 year ago

Correct, the package is designed for the most common data warehouse use cases, which typically include populated fact and dimension tables that represent business processes. Running data quality checks on empty tables in production isn't really something I'd consider in scope.
I'd recommend implementing your own test for your use case, or set the severity level to warn.

adamcunnington-mlg commented 1 year ago

Ok, I think that's fair.

As an aside, I'd be surprised if kimball/inmon still represent the majority of DWH use cases. I think it will depend on the industry and type of use case. I'd agree this is likely still the majority for incumbent blue chips who are doing data warehousing on internal datasets - but I'd be surprised if that's representative of the majority of the dbt use case - likely more future-oriented use cases.