datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
278 stars 41 forks source link

Quality Checks #55

Closed jochenchrist closed 1 week ago

jochenchrist commented 7 months ago

Idea

The intent is to define quality checks directly at the model with a well-defined (yet extensible) set of quality checks.

Constraints

Quality Checks should be exportable to and executable through major data quality tools (soda-core, great-expectation, dbt-expectations, montecarlo, plain-SQL ...)

Option A:

models:
  orders:
    fields: ...
    quality: 
      # example for a well-defined check type `row_count`, that comes with a defined list of arguments
      - type: row_count
        must_be_greater_than: 10000
        description: It should always contains at least 10.000 orders.

      # Support for custom queries and metric functions: https://docs.snowflake.com/en/sql-reference/sql/create-data-metric-function. Query must return a number or boolean value.
      - type: sql
        description: There cannot be an order with negative order_total, except it contains a voucher code
        query: |
          SELECT
              COUNT(*)
            FROM orders
            WHERE
              order_total<0
              AND arg_c2>0
              AND arg_c3>0
        must_be_equal_to: 0
        # Problem: SQL is dialect-specific

      # support for plain-text checks that are not yet implemented?
      - type: business-rule   # alternatives: text, descriptive,  business, plain-text
        description: It should always contains at least 10.000 orders.

      # a more complex example
      - type: expect_row_values_to_have_data_for_every_day
        field: order_date

      # a great expectation quality check
     - type: gx
       expectation_type: expect_column_to_have_something
       kwargs:
         column: transaction_type
         value_set: ["purchase", "refund", "upgrade"]

Great Expectations expectation gallery that might become a reference for additional well-defined checks: https://greatexpectations.io/expectations?viewType=Datasource&filterType=Backend+support&showFilters=true&subFilterValues=bigquery%2C+postgresql%2C+redshift%2C+snowflake%2C+spark

dbt-expectations also has a nice library: https://hub.getdbt.com/calogica/dbt_expectations/latest/

Soda Data Contracts Reference: https://docs.soda.io/soda/data-contracts-checks.html

jochenchrist commented 7 months ago

It would be great if we could adopt https://github.com/bitol-io/tsc/blob/main/rfcs/0007-data-quality.md

jochenchrist commented 7 months ago

Option B: Adopt Great Expectation Syntax

models:
  orders:
    fields: ...
    quality:
    - expectation_type: expect_table_row_count_to_be_between
      kwargs:
        min_value: 1
        max_value: 1000000
    - expectation_type: expect_column_values_to_not_be_null
      kwargs:
        column: customer_id
    - expectation_type: expect_column_values_to_be_in_set
      kwargs:
        column: status
        value_set:
        - active
        - inactive
        - pending
        - deleted
    - expectation_type: expect_compound_columns_to_be_unique
      kwargs:
        column_list:
        - customer_id
        - account_number
      meta:
        notes: This expectation checks that each combination of customer_id and account_number
          is unique, ensuring no duplicate records for these fields.

And there are of course variants, such as flatten kwargs.

simonharrer commented 7 months ago

@saugerDecathlon I‘d be interested in your opinion.

jochenchrist commented 6 months ago

PR: #65