calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.06k stars 130 forks source link

Add excluded dates for `expect_row_values_to_have_data_for_every_n_datepart` #140

Closed gofford closed 2 years ago

gofford commented 2 years ago

expect_row_values_to_have_data_for_every_n_datepart is tremendously useful for validating whether every day has data.

However, in some situations, there may be days where there is a data gap and I'm completely OK with it. For example:

The test currently assumes full continuity and will always fail in these scenarios. It would be useful to have a provision to set 'accepted gaps' whereby an error won't be raised if a provided date (or matching date) is missing.

Implementing this could be a relatively simple argument that passes to a conditional where clause in the final query object, similar to how row_condition is applied to model_data

clausherther commented 2 years ago

Hi @gofford! Thanks for raising this issue, those use cases makes sense to me! Would you be interested in taking a stab at a PR? Happy to help! I could see passing in a list of exception dates? A more complex but more robust solution would be to pass in a table of exception dates as model ref, but I think the list approach might cover a lot of ground to start. What do you think?

gofford commented 2 years ago

Sounds good @clausherther. I can definitely see the utility of an external ref table for exclusions, particularly in the case where there are a large number of dates to be excluded. In the interim I think a good starting point would be to emulate the row_condition logic to allow for flexible exclusion.

Locally, I have amended the macro to allow for tests of the form:

tests:
  - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
      date_col: date_day
      date_part: day
      exclusion_condition: date_day not in ('2019-12-25', '2020-12-25', '2021-12-25')

# or equivalently

tests:
  - dbt_expectations.expect_row_values_to_have_data_for_every_n_datepart:
      date_col: date_day
      date_part: day
      exclusion_condition: not(date_part(month, date_day) = 12 and date_part(day, date_day) = 25)

exclusion_condition (default: None) is then applied conditionally as part of a WHERE clause on the final CTE.

What do you think of this? This is working fine for my immediate (admittedly limited) use-case, and can certainly be improved moving forward. If you're happy with this as a starting point I can set up a PR.

clausherther commented 2 years ago

@gofford I like the approach! Makes sense to allow users to configure this for their use case and platform. In theory, I think this could cover my example of an exclusion table, e.g.

exclusion_condition: date_day not in (select date_day from dates where is_holiday = true)

or even

exclusion_condition: date_day not in (select date_day from ref('dates') where is_holiday = true)

(this last one may need some refinement and testing)

clausherther commented 2 years ago

Closed via #141