calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.06k stars 128 forks source link

Expect table column names and dtypes #254

Closed adammarples closed 1 year ago

adammarples commented 1 year ago

This PR creates a schema test which will directly compare the column name : data type mapping of the relation with a given python dictionary.

After comparing upper cased column names and data types, it will run an EXCEPT/UNION difference type query which will return any differences found between the two mappings and returns rows containing the difference and a found_in column explaining the error

image

This is intended to be a complete coverage for schema changes in place more verbose workarounds like this https://madisonmae.substack.com/p/tutorial-using-dbt-to-test-for-schema

clausherther commented 1 year ago

Hi, thanks for the PR! I think I understand the intent here (save time on typing a schema definition and tests), but I think the current dbt-ish way to accomplish this is to define each column in a schema.yml file and then add the appropriate type tests, which we already have with expect_column_values_to_be_of_type and expect_column_values_to_be_in_type_list.

clausherther commented 1 year ago

If we were to go down this route, I would prefer not to use a Python dict, but use YAML to define structure since that's the commonly used approach in this repo and other dbt package repos.

adammarples commented 1 year ago

Hi @clausherther, thanks for the reply. I have modified the mapping to be pure YAML, like this.

sources:
  - name: my_source
    tables:
      - name: my_table
        tests:
          - dbt_expectations.expect_table_column_names_and_dtypes:
              mapping:
                 name: VARCHAR
                 birthday: DATE
                 age_in_years: INTEGER

Appreciate what you are saying about the dbt way of doing things. The intent here is really to try and provide a full check on changing source schemas without the yaml getting too verbose. It's not so much to save time but to save long yaml blocks and mental overhead in reading and maintaining them. The current way seems to be a combination of the following:

sources:
  - name: my_source
    tables:
      - name: my_table
        tests:
          - dbt_expectations.expect_table_column_count_to_equal:
              value: 3
          - dbt_expectations.expect_table_columns_to_match_set:
              column_list: ["name", "birthday", "age_in_years"]
        columns:
          - name: name
            tests:
              - dbt_expectations.expect_column_to_exist
              - dbt_expectations.expect_column_values_to_be_of_type:
                    column_type: varchar
          - name: birthday
            tests:
              - dbt_expectations.expect_column_to_exist
              - dbt_expectations.expect_column_values_to_be_of_type:
                    column_type: date
          - name: age_in_years
            tests:
              - dbt_expectations.expect_column_to_exist
              - dbt_expectations.expect_column_values_to_be_of_type:
                    column_type: integer

Completely up to you, I will leave this here