Closed adamcunnington-mlg closed 2 years ago
Thanks @adamcunnington-mlg for raising this issue! I don't have a ton experience using regex on BigQuery, but this is good to know. One thing to always keep in mind is that all we're doing in dbt is generate SQL strings, which sometimes results in these weird quoting and escape scenarios. In your example, what should the fully rendered SQL look like? Re: your suggestion, happy to make/merge changes to the BigQuery (and other) implementations if that eases the pain. Would you be interested in doing a PR for this?
Thanks @clausherther.
Here is what I expected:
my dbt YAML
columns:
- name: time
tests:
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^\\d{4}-\\d{2}$" # unfortunately, I'm not sure the double \ could be avoided here because we have to satisfy YAML escaping regardless
compiled SQL
...
regexp_instr(title, r'^\d{4}-\d{2}$', 1, 1)
...
Instead, the r
is not present in final SQL and so I had to use this horrible escape sequence:
YAML:
columns:
- name: time
tests:
- dbt_expectations.expect_column_values_to_match_regex:
regex: "^\\\\d{4}-\\\\d{2}$"
which leads to compiled SQL:
...
regexp_instr(title, '^\\d{4}-\\d{2}$', 1, 1)
...
Cool, thanks for example. So, what does the r
actually do, if we were to pre-pend that? Is that maybe something that should be optional as a parameter?
Btw, how do you get from d{4}
to w{3}
in your examples?
@clausherther With a magical typo! Sorry - corrected.
r""
is common in many syntaxes/languages to mean "raw string" - e.g. do not recognise escape sequences in following string.
@adamcunnington-mlg would you have an example of data that would match regexp_instr(title, r'^\d{4}-\d{2}$', 1, 1)
?My regex fu is non-existent, and I'll need to add a integration test for this. Thanks!
I guess something like a 5+3 US zip code would match regexp_instr(postal_code, r'^\d{5}-\d{3}$', 1, 1)
?
Btw, doesn't look like either Postgres or Snowflake support r
as a raw string indicator. Will have to do more research here before we can implement something like that.
Update: looks like Snowflake uses $$..$$
, so in your pattern above, for both BigQuery and Snowflake, I think we can drop the final $
.
@clausherther sure!
\d{4}-\d{2}
is a typical YYYY-MM
date, e.g. 2022-09
.
Your guess post code example works too.
$
is a special character in regex that means "end of string" - not sure what implications are there for snowflake.
EDIT: I see what you mean about Snowflake - it's from the docs here. Snowflake is weird - this is a dumb way to handle regex strings. r"" is the standard in virtually every other language i've seen - but anyway, your code can have a concept of raw string wrapping characters - that should allow you to accommodate different dialects in a generalised way.
@clausherther what are your thoughts on this? cheers
@adamcunnington-mlg I had a PR here, but I'm not sure how generalizable this and if it would break a lot of folks code. The edge cases are tricky to test b/c it depends so much on user input.
I'm leaning towards maybe making raw strings optional, i.e. if you pass in a raw string, you'll need to specify it as such via something like is_raw_string: True
. What do you think?
@clausherther I think that would be absolutely fine - could always be evolved later
Functions such as expect_column_values_to_match_regex are painful to use in BigQuery because of the intersection of YAML escape sequences, BigQuery escape sequences and implementation of the tests, by dbt-expectations (I'm not sure actually if dbt-core has some responsibility here to fix this so do say if the fix doesn't belong here).
Given some trivial regex expression,
"\d{4}"
, in BigQuery, I would make use of raw strings to avoid illegal escape sequences, e.g.r"\d{4}"
but dbt-expectations/dbt expects my yaml to be a string and so I can't user
otherwise it'll be interpreted as'r"\d{4}"'
.Probably dbt-expectations should be smart enough to insert r in front of the string if the adapter supports it (such as BigQuery).
In the meantime, I probably have to double escape the
\
. Once to satisfy YAML escaping and twice to satisfy BigQuery escaping when a raw string is not used.E.g. I probably need to do: