calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.06k stars 128 forks source link

[BUG] regexp_instr does not validate regex in PostgreSQL #249

Closed lookslikeitsnot closed 1 year ago

lookslikeitsnot commented 1 year ago

Is this a new bug in dbt-expectations?

Current Behavior

Any call to the regexp_instr macro will be valid for PostgreSQL. This is due to the generated expression and the handling of nulls by Postgres.

Dissecting the generated expression:

  with grouped_expression as (
    select array_length(
        (
          select regexp_matches(postal_code_5_3, '^\d{5}-\d{3}', '')
        ),
        1
      ) > 0 as expression
    from <db_name>.dbt_expectations_integration_tests.data_text
  ),

  validation_errors as (
    select *
    from grouped_expression
    where not(expression = true)
  )

  select *
  from validation_errors
  1. select regexp_matches('<valid_source>', '<valid_regex>', '<any_flags>') will always return either no results or a text[].
  2. Getting the array_length of that will either be null or >0
  3. Comparing that to 0 will return null or true respectively
  4. Comparing that to true will once again return null or true respectively
  5. Negating that will return null or false which have the same full filter effect in a where clause

Expected Behavior

regexp_instr should fail when regex doesn't match.

Steps To Reproduce

Change any regex in "dbt-expectations/integration_tests/models/schema_tests/schema.yml" with anything else (e.g. replace the '^\d{5}-\d{3}' regex with 'definitely_not_a_valid_postal_code') , the tests will still succeed.

Environment

- OS: Manjaro Linux 22.0.2
- Python: 3.10.9
- dbt: 1.4.5
- dbt-expectations: 0.8.3

Which database adapter are you using with dbt?

Postgres

Suggested fix

Fix with the widest Postgres version compatibility range would probably be to coalesce the array_length i.e. to replace array_length((select regexp_matches({{ source_value }}, '{{ regexp }}', '{{ flags }}')), 1) with coalesce(array_length((select regexp_matches({{ source_value }}, '{{ regexp }}', '{{ flags }}')), 1), 0) in "dbt-expectations/macros/regex/regexp_instr.sql"

clausherther commented 1 year ago

Closed via #250