calogica / dbt-expectations

Port(ish) of Great Expectations to dbt test macros
https://calogica.github.io/dbt-expectations/
Apache License 2.0
1.01k stars 123 forks source link

Update regexp_instr.sql #226

Closed fivetran-joemarkiewicz closed 1 year ago

fivetran-joemarkiewicz commented 1 year ago

The Error

The Spark regexp_instr function only takes two arguments, whereas the default version of the macro in this package provides 4. As a result I receive the following error when running in Databricks.

18:01:44  Completed with 1 error and 0 warnings:
--
  | 18:01:44
  | 18:01:44  Runtime Error in test dbt_expectations_expect_column_values_to_not_match_regex_list_stg_google_ads__ad_history_source_final_urls__any___ (models/stg_google_ads.yml)
  | 18:01:44    Invalid number of arguments for function regexp_instr. Expected: one of 2 and 3; Found: 4; line 22 pos 0
  | 18:01:44
  | 18:01:44  Done. PASS=25 WARN=0 ERROR=1 SKIP=0 TOTAL=26

Source of the Error

When looking further into the behavior, I see the function and argument for the 11.2 (and greater) Databricks runtime here. You can see in the docs the only arguments it accepts are the str and the regexp arguments. These map 1:1 with the source_value and regexp arguments in the regexp_instr macro within this package.

The Proposed Solution

I noticed in v0.6.0 the spark dispatch was removed from the macro. However, when upgrading to the latest version of the package I notice the above error since the default does not work with the 4 arguments opposed to the 2 that Databricks accepts for this function.

As such, I am proposing adding the spark dispatch back into the macro to resolve this error. The modification is quite minor in the sense that I am only removing the position and occurrence pieces from the function call. When I made this change and ran locally, I saw the following result.

## My configured yml
      - name: source_final_urls
        description: The original list of final urls expressed as an array. Please be aware the test used on this field is intended to warn you if you have fields with multiple urls. If you do, the `final_url` field will filter down the urls within the array to just the first. Therefore, this package will only leverage one of possibly many urls within this field array.
        tests:
          - dbt_expectations.expect_column_values_to_not_match_regex_list:
              regex_list: ","
              match_on: any
              severity: warn
18:28:40  Completed with 1 warning:
18:28:40  
18:28:40  Warning in test dbt_expectations_expect_column_values_to_not_match_regex_list_stg_google_ads__ad_history_source_final_urls__any___ (models/stg_google_ads.yml)
18:28:40    Got 1 result, configured to warn if != 0
18:28:40  
18:28:40    compiled Code at target/compiled/google_ads_source/models/stg_google_ads.yml/dbt_expectations_expect_column_f02cb56d69a9df1e600b3958899ceaa5.sql
18:28:40  
18:28:40  Done. PASS=25 WARN=1 ERROR=0 SKIP=0 TOTAL=26

I can see if compiled with the warning as expected!

Additional Details

Let me know if I missed something that allows me to configure the arguments for spark. However, I did not see this at first inspection. Happy to collaborate more on this if needed. Thanks!

clausherther commented 1 year ago

Hi @fivetran-joemarkiewicz, thanks for this PR! We actually removed the spark dispatch a while back in a general revision to remove explicit spark and databricks support from the package. This is mainly b/c we currently do not have CI for those platforms and thus can't reliably test those macros. We currently only support Postgres, BigQuery and Snowflake for that reason. I would recommend adding this to the spark-utils package as a shim (similar how to that package provides shims for dbt-utils and snowplow package macros). In the past, folks have also created similar shims for dbt-expectations and dbt-date in the tsql-utils package. Hope that works for you!

clausherther commented 1 year ago

Btw, I realize there's a dangling redshift dispatch in that file that belies what I said above, but some of these are because Redshift is inconsistent with its Postgres compatibility. We should clean that up though.

fivetran-joemarkiewicz commented 1 year ago

Thanks so much for the suggestion @clausherther and my appologies for not noticing the spark/databricks support was removed a while ago 🤦.

I just opened an Issue on spark-utils (linked above) detailing my plans for moving this PR into the spark-utils project.

In the meantime, since I am already deep within this macro I can give the redshift removal a try and confirm if you are in the clear to remove that piece of code from this macro. If all looks good on my end I can edit my PR to remove the Redshift piece as opposed to adding the spark dispatch.

I'll plan to have an update for you by the end of the day!

clausherther commented 1 year ago

Thanks so much for the suggestion @clausherther and my appologies for not noticing the spark/databricks support was removed a while ago 🤦.

I just opened an Issue on spark-utils (linked above) detailing my plans for moving this PR into the spark-utils project.

In the meantime, since I am already deep within this macro I can give the redshift removal a try and confirm if you are in the clear to remove that piece of code from this macro. If all looks good on my end I can edit my PR to remove the Redshift piece as opposed to adding the spark dispatch.

I'll plan to have an update for you by the end of the day!

Hi Joe, great, thanks! Re: the Redshift dispatch - let's leave that in for now, since removing that would be a potentially breaking change. The redshift macros don't require anything special, unlike the spark dispatch macros, since dbt falls back to postgres if there's no redshift version. Usually that works, which is why a couple of folks are able to use the package on Redshift despite the lack of official support. This redshift macro is a concession to that for a case where Redshift doesn't speak pure Postgres. Hope that makes sense.

fivetran-joemarkiewicz commented 1 year ago

Hey @clausherther moments after I sent my above message I realized that this would be quite breaking to remove. I attempted to remove it in my testing environment and found the postgres dispatch be used just like you mentioned and it did not play nicely with Redshift.

I think this is an appropriate concession and also helps my case to leave this in here as well! Thanks again for the detailed explanation and for guiding me to shim the spark/databricks dispatch version.

This all makes sense and we can close this PR. 😄