dbt-labs / dbt-audit-helper

Useful macros when performing data audits
https://hub.getdbt.com/dbt-labs/audit_helper/latest/
Apache License 2.0
329 stars 40 forks source link

Fix for dbt-audit-helper issue #84 #85

Closed igor-lobanov-maersk closed 6 months ago

igor-lobanov-maersk commented 1 year ago

Escaping "count" field using double quotes, which I believe is SQL standard.

Description & motivation

I believe this is all that's required to fix #84

Checklist

igor-lobanov-maersk commented 1 year ago

Ok, see that this didn't work for Snowflake. Will take another look at syntax options.

dbeatty10 commented 1 year ago

@igor-lobanov-maersk I suspect that what you're seeing in the CI failures for dbt-snowflake is a result of the default quoting rules for dbt-snowflake:

image

You're changing it from being an unquoted identifier into a quoted one, but dbt-snowflake is configured to treat identifiers as unquoted.

I don't immediately know how to make all the adapters happy here.

Crux of the issue

The crux seems to be that (by default) dbt-snowflake doesn't want quoted identifiers, but dbt-dremio has count as a reserved keyword for identifiers.

Some resources to help research and troubleshoot:

One idea

Maybe something like this (using quote config):

models:
  - name: compare_queries
    columns:
      - name: count
        quote: true
    tests:
      - dbt_utils.equality:
          compare_model: ref('expected_results__compare_relations_without_exclude')
igor-lobanov-maersk commented 1 year ago

Thank you for getting back to me @dbeatty10, this is a good pointer. I will explore this further. It's a bit difficult for me to follow as I don't have access to a Snowflake environment, but I can try a few more things against the CI build.

Another way this could be addressed is if count was actually renamed to something less contentious like row_count. This should work everywhere. However, this would change 'public' interface of this module. Not sure to what extent it's a problem. What do you think?

dbeatty10 commented 1 year ago

Another way this could be addressed is if count was actually renamed to something less contentious like row_count. This should work everywhere. However, this would change 'public' interface of this module. Not sure to what extent it's a problem. What do you think?

This is an insightful idea and you're spot-on on the affect it would have on the public interface.

Since this has been a stable public interface for 4+ years, we'd be hesitant to change it.

We'd probably want to check if even merely changing this from unquoted to quoted would affect the public interface that folks may be using.

Let's brainstorm some ideas

First, I'm going to just list out a bunch of ideas (and temporarily withhold judgement if they are "good" or not):

  1. Change the public interface of dbt-audit-helper to not use any SQL reserved keywords (like count) as column names
  2. Dremio allow the SQL reserved keyword count as a column name (similarly to how it is apparently allowed by Snowflake, etc).
  3. Put specific hard-coded logic within dbt-audit-helper that adds quotes only when the adapter is dremio
  4. Put specific hard-coded logic within dbt-audit-helper that adds quotes for all adapters except for snowflake
  5. Create an adapter-level interface that specifies which words are empirically not accepted as unquoted table or column name (like this), and use this interface to decide whether to quote or not
  6. Create an adapter-level interface that specifies if the unquoted identifiers are uppercased, lowercased, or samecased within their information_schema.columns-compatible metadata store.
  7. Make no changes and accept that dbt-dremio users are not able to use the audit_helper.compare_queries macro directly as-is

Considering trade-offs

I'd prefer not to change the public interface since it's been stable. Although count is a reserved keyword, it seems many databases allow it to be unquoted in practice (all the databases tested within this package as well as all the databases listed here):

image

I'm not guessing the Dremio maintainers would want to change they way they are handling count, but that would be easiest for everyone else, and it would put them more in line with how other databases are handling it. It would also be a non-breaking change for users of Dremio. But we can't assume or control if that will happen.

I'd also strongly prefer not to add adapter-specific logic within audit_helper related specifically to Dremio. It would create maintenance issues of untested logic, and dbt packages should really be agnostic of details related to adapters.

Creating an adapter-level interface is actually my favorite idea, but it would be a more involved and long-term project than the scope of this PR.

Using dispatch is the most "dbt-onic" approach. It's takes a little bit more explanation for end users to use it, and it is necessary within each dbt project.

Let's make a decision

After reviewing all of the options and their trade-offs, dispatch is the way to go here. There's two options from the perspective of dbt-dremio maintainers:

Long-term, I'd still be quite interested in an adapter-level interface. i.e. an interface that can identify words that empirically determined to be invalid without quoting and/or an interface that surfaces how the database stores the casing of unquoted columns within its metadata.

github-actions[bot] commented 7 months ago

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

github-actions[bot] commented 6 months ago

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.