Closed jochenchrist closed 1 week ago
It would be great if we could adopt https://github.com/bitol-io/tsc/blob/main/rfcs/0007-data-quality.md
models:
orders:
fields: ...
quality:
- expectation_type: expect_table_row_count_to_be_between
kwargs:
min_value: 1
max_value: 1000000
- expectation_type: expect_column_values_to_not_be_null
kwargs:
column: customer_id
- expectation_type: expect_column_values_to_be_in_set
kwargs:
column: status
value_set:
- active
- inactive
- pending
- deleted
- expectation_type: expect_compound_columns_to_be_unique
kwargs:
column_list:
- customer_id
- account_number
meta:
notes: This expectation checks that each combination of customer_id and account_number
is unique, ensuring no duplicate records for these fields.
And there are of course variants, such as flatten kwargs
.
@saugerDecathlon I‘d be interested in your opinion.
PR: #65
Idea
The intent is to define quality checks directly at the model with a well-defined (yet extensible) set of quality checks.
Constraints
Quality Checks should be exportable to and executable through major data quality tools (soda-core, great-expectation, dbt-expectations, montecarlo, plain-SQL ...)
Option A:
Great Expectations expectation gallery that might become a reference for additional well-defined checks: https://greatexpectations.io/expectations?viewType=Datasource&filterType=Backend+support&showFilters=true&subFilterValues=bigquery%2C+postgresql%2C+redshift%2C+snowflake%2C+spark
dbt-expectations also has a nice library: https://hub.getdbt.com/calogica/dbt_expectations/latest/
Soda Data Contracts Reference: https://docs.soda.io/soda/data-contracts-checks.html