calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.06k stars 130 forks source link

Tests that expect regex patterns are painful to use in BigQuery #190

Closed adamcunnington-mlg closed 2 years ago

adamcunnington-mlg commented 2 years ago

Functions such as expect_column_values_to_match_regex are painful to use in BigQuery because of the intersection of YAML escape sequences, BigQuery escape sequences and implementation of the tests, by dbt-expectations (I'm not sure actually if dbt-core has some responsibility here to fix this so do say if the fix doesn't belong here).

Given some trivial regex expression, "\d{4}", in BigQuery, I would make use of raw strings to avoid illegal escape sequences, e.g. r"\d{4}" but dbt-expectations/dbt expects my yaml to be a string and so I can't use r otherwise it'll be interpreted as 'r"\d{4}"'.

Probably dbt-expectations should be smart enough to insert r in front of the string if the adapter supports it (such as BigQuery).

In the meantime, I probably have to double escape the \. Once to satisfy YAML escaping and twice to satisfy BigQuery escaping when a raw string is not used.

E.g. I probably need to do:

    regex: "\\\d{4}" # note that the encasing "" are irrelevant here.
clausherther commented 2 years ago

Thanks @adamcunnington-mlg for raising this issue! I don't have a ton experience using regex on BigQuery, but this is good to know. One thing to always keep in mind is that all we're doing in dbt is generate SQL strings, which sometimes results in these weird quoting and escape scenarios. In your example, what should the fully rendered SQL look like? Re: your suggestion, happy to make/merge changes to the BigQuery (and other) implementations if that eases the pain. Would you be interested in doing a PR for this?

adamcunnington-mlg commented 2 years ago

Thanks @clausherther.

Here is what I expected:

my dbt YAML

columns:
- name: time
   tests:
   - dbt_expectations.expect_column_values_to_match_regex:
       regex: "^\\d{4}-\\d{2}$"  # unfortunately, I'm not sure the double \ could be avoided here because we have to satisfy YAML escaping regardless

compiled SQL

...
regexp_instr(title, r'^\d{4}-\d{2}$', 1, 1)
...

Instead, the r is not present in final SQL and so I had to use this horrible escape sequence: YAML:

columns:
- name: time
   tests:
   - dbt_expectations.expect_column_values_to_match_regex:
       regex: "^\\\\d{4}-\\\\d{2}$"

which leads to compiled SQL:

...
regexp_instr(title, '^\\d{4}-\\d{2}$', 1, 1)
...
clausherther commented 2 years ago

Cool, thanks for example. So, what does the r actually do, if we were to pre-pend that? Is that maybe something that should be optional as a parameter?

Btw, how do you get from d{4} to w{3} in your examples?

adamcunnington-mlg commented 2 years ago

@clausherther With a magical typo! Sorry - corrected.

r"" is common in many syntaxes/languages to mean "raw string" - e.g. do not recognise escape sequences in following string.

See Raw string in the table of this page

clausherther commented 2 years ago

@adamcunnington-mlg would you have an example of data that would match regexp_instr(title, r'^\d{4}-\d{2}$', 1, 1)?My regex fu is non-existent, and I'll need to add a integration test for this. Thanks!

clausherther commented 2 years ago

I guess something like a 5+3 US zip code would match regexp_instr(postal_code, r'^\d{5}-\d{3}$', 1, 1)?

clausherther commented 2 years ago

Btw, doesn't look like either Postgres or Snowflake support r as a raw string indicator. Will have to do more research here before we can implement something like that.

Update: looks like Snowflake uses $$..$$, so in your pattern above, for both BigQuery and Snowflake, I think we can drop the final $.

adamcunnington-mlg commented 2 years ago

@clausherther sure! \d{4}-\d{2} is a typical YYYY-MM date, e.g. 2022-09.

Your guess post code example works too.

$ is a special character in regex that means "end of string" - not sure what implications are there for snowflake.

EDIT: I see what you mean about Snowflake - it's from the docs here. Snowflake is weird - this is a dumb way to handle regex strings. r"" is the standard in virtually every other language i've seen - but anyway, your code can have a concept of raw string wrapping characters - that should allow you to accommodate different dialects in a generalised way.

adamcunnington-mlg commented 2 years ago

@clausherther what are your thoughts on this? cheers

clausherther commented 2 years ago

@adamcunnington-mlg I had a PR here, but I'm not sure how generalizable this and if it would break a lot of folks code. The edge cases are tricky to test b/c it depends so much on user input. I'm leaning towards maybe making raw strings optional, i.e. if you pass in a raw string, you'll need to specify it as such via something like is_raw_string: True. What do you think?

adamcunnington-mlg commented 2 years ago

@clausherther I think that would be absolutely fine - could always be evolved later