dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.77k stars 1.62k forks source link

[CT-2137] [Feature] Cross-database implementation of is_distinct_from #6997

Open joellabes opened 1 year ago

joellabes commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

If I had a dollar for every time I had done

select a, b, 
  a = b or (a is null and b is null) as are_the_same
from table

or

select a, b, 
  coalesce(a, -1) = coalesce(b, -1) as are_the_same --very dangerous! do not do!
from table

, I would have a lot of money.

Turns out that there is a solution for this in the sql standard!

image

We should add this as a cross-db macro - a speedy Google looks like most of our key adapters support it natively except for Redshift

Describe alternatives you've considered

Doing nothing.

Who will this benefit?

This seems useful for two reasons:

Are you interested in contributing this feature?

Yep! But it'll be my first cross-adapter contribution so I'll need some handholding

Anything else?

No response

dbeatty10 commented 1 year ago

Thanks for opening @joellabes !

What kind of syntax are you imagining, something like this?

{{ dbt.is_distinct_from(a, b) }}

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

For the Redshift implementation (or any other non-conforming database), we'd want to be diligent with the implementation. As you noted as dangerous!, we wouldn't want to do this implementation, but one that works correctly for all possible input values. (I suspect the SQL in that link would be incorrect when one of width & height is 0 and the other is NULL.)

The truth table here would be handy for integration tests.

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

joellabes commented 1 year ago

What kind of syntax are you imagining, something like this?

{{ dbt.is_distinct_from(a, b) }}

Yeah I think that sounds right to me!

Would you imagine it being being paired with is_not_distinct_from too? Or just rely on the user to do the negation?

I imagined that the user would BYON (bring your own negation)

Side note: both of the SQL examples you gave would be awesome for a SQL linter to alert about!

As in it would say "hey did you know there is a better way to do this"? That would be very cool.

dbeatty10 commented 1 year ago

Two follow-up questions:

  1. Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.
  2. Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?
owenprough-sift commented 1 year ago
2. Do you know what the generic SQL expression would look like for databases that don't natively support `is distinct from`?

Probably something like what Joel had in the first post:

( ( a = b ) or ( ( a is null ) and ( b is null ) ) ) /* use as many/few parens as make you feel safe, I reckon */
joellabes commented 1 year ago
  1. Are there two or three instances where you've seen a this type of logic expressed (like in the metrics package or other packages? This could help us refine if we're getting the syntax right or not, especially if users are writing their own negations.

No, but that's a 100% fair question before putting it into Core. I desperately hope that my code is no longer in the metrics package 😂 (@callum-mcdata?)

I imagine negations would be

select a, b, 
not ({{ dbt.is_distinct_from('a', 'b') }}) as are_the_same
from {{ ref('the_table') }}

and joins would be

select *
from a
inner join b on not ({{ dbt.is_distinct_from('a.c', 'b.c') }})
and not ({{ dbt.is_distinct_from('a.d', 'b.d') }})
  1. Do you know what the generic SQL expression would look like for databases that don't natively support is distinct from?

Yeah agreed with Owen! mo' brackets le' problems, as they say

dbeatty10 commented 1 year ago

Names of the features within the standard

Part 2 of the SQL standard (ISO/IEC 9075-2) added these IDs and names in SQL:1999 and SQL:2003, respectively ¹:

Should we flip it?

All the examples we've looked at so far are actually for T152 rather than T151!

So I'm wondering, maybe we want to actually implement the equivalent of {{ dbt.is_not_distinct_from('a', 'b') }} rather than {{ dbt.is_distinct_from('a', 'b') }}? To know one way or the other, I think we'll want to see how null-safe comparisons are used most often.

Pondering names

There are only two hard things in Computer Science: cache invalidation, naming things, and three-valued logic.

I can't say I fully grok it all, but I think the source of this issue we're trying to solve is that a = b and a != b aren't intuitive to humans because of SQL's three-valued logic.

I'm guessing that referring to T152 with one of the alternatives below is borderline blasphemous for some reason. Can someone explain to me like I'm five why they are a bad idea?

Shortening the name is inspired by SQLite's is and MySQL/MariaDB's <=> spaceship operator.

Using is might(?) actually have some synergy with optional feature F571, "Truth value tests" from the SQL standard.

Let's find some examples

Let's examine some examples from the wild to confirm/deny if T151 or T152 semantics are most common. A good place to search would be the GitHub repos of the packages listed at https://hub.getdbt.com. Some (untested!) case-insensitive ideas for regular expressions to use with grep:

callum-mcdata commented 1 year ago

Oh dang - I would have loved this functionality about 6 weeks ago when I was thinking about how we could support null-safe joins. More information in this issue here..

The use case for the metrics package was specifically around joining the outputs of two unique metrics queries where the values within a particular column might be null. Users were running into issues when the null values of dimension_a from metric_a weren't able to join to the null values from dimension_a from metric_b. This led to "misreporting" of numbers because the metric associated with the null values wasn't included in the join.

dbeatty10 commented 1 year ago

It could be nice to implement this macro to make it easy to express the logic in cases like https://github.com/dbt-labs/dbt-adapters/issues/159, dbt snapshots, etc.

Refinement needed # 1

The biggest outstanding decision is naming.

Pros: equals(a, b) makes a lot more intuitive sense to me than is_not_distinct_from. It's also a lot easier to write than more explicit alternatives like null_safe_equals.

Cons: I'm unaware of the strongest arguments against the simple & intuitive naming, but would welcome that insight!

Summary: I'm attracted to something simple with clear intent like equals rather than is_not_distinct_from.

Refinement needed # 2

The other outstanding decision is whether to implement both the NULL-aware comparisons of equality and inequality or just equality.

(I'm thinking we wouldn't want to do only inequality as originally requested because the SQL is quite a bit more complicated and equality might be more common.)

If we support both, then the syntax and rendered SQL would be like the following: {{ dbt.equals(a, b) }}

( (a = b) or (a is null and b is null) )

{{ dbt.not_equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Versus if we only do equality, then inequality would need to be expressed like this:

not {{ dbt.equals(a, b) }}

not ( (a = b) or (a is null and b is null) )

Pros: Doing both might make them easier to discover and use. The implementation of not equals follows trivially from equals. Opportunity for adapters to optimize both implementations if needed.

Cons: Two different macros to maintain, document, and test. Two different ways for the user to express inequality.

Summary: As long as there is a macro for expressing equality, it doesn't seem necessary that we offer a macro for inequality.

joellabes commented 1 year ago

Question 1

I am in two minds:

The good news is that my friends at Snowflake have split the difference, with what I assume is a wrapper over the top of is distinct from: equal_null.

So I would vote for either that, or something like safe_equals in the same vein as safe_add, safe_divide...

Question 2

I agree with you, I don't think we need both - people can prefix a not if they need to.

dbeatty10 commented 1 year ago

Naming

Most frequently, you'll find me on "team standards" which would have me throwing my weight behind dbt.is_not_distinct_from as the name for null-safe equals comparison. But it's unintuitive enough that I don't think it will solve our problem here. Same goes for equal_null.

After thinking on this quite a bit, equals feels best. Intuitive. Simple. Easy.

I'd expect it to be rare for people to be surprised about its behavior. Rather, it's simple enough to be used without reading or consulting the documentation. Can't imagine a sweeter spot than that.

Implementation

Due to the trickiness of SQL's three-valued logic, the implementation needs to be like this:

{% macro equals(actual, expected) %}
{# -- actual is not distinct from expected #}
case when {{ actual }} = {{ expected }} or ({{ actual }} is null and {{ expected }} is null)
    then 0
    else 1
end = 0
{% endmacro %}

Since postgres supports is [not] distinct from, it can be used to show why the case-when is necessary to force either a true or a false value (rather than stubbornly propagating null):

Example ```sql with x as ( select cast(1 as integer) as i union all select cast(2 as integer) as i union all select cast(null as integer) as i ), permutations as ( select x1.i as x1_i, x2.i as x2_i from x as x1, x as x2 order by x1.i, x2.i ) select -- values to compare x1_i, x2_i, -- logic for equals macro x1_i = x2_i or (x1_i is null and x2_i is null) as equals_logic, -- compliant syntax for null-safe equality x1_i is not distinct from x2_i as is_not_distinct_from, -- case/when logic for equals macro case when x1_i = x2_i or (x1_i is null and x2_i is null) then 0 else 1 end = 0 as case_equals_macro, -- negation of logic for equals macro not (x1_i = x2_i or (x1_i is null and x2_i is null)) as not_equals_logic, -- compliant syntax for null-safe inequality x1_i is distinct from x2_i as is_distinct_from, -- negation of case/when logic for equals macro not case when x1_i = x2_i or (x1_i is null and x2_i is null) then 0 else 1 end = 0 as case_equals_macro from permutations ; ```
github-actions[bot] commented 10 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

dbeatty10 commented 10 months ago

Implementation

In dbt-labs/dbt-core#7776, we implemented a null-safe equals macro for the purposes of supporting test cases:

https://github.com/dbt-labs/dbt-core/blob/5488dfb9926c3f81e6a80bd0d8b01fe7f0ac7a60/tests/adapter/dbt/tests/adapter/utils/base_utils.py#L5-L10

Its logic is tested here (which uses these fixtures).

It is only used within functional tests in pytest; it is not currently available to the adapter outside of the pytest environment.

We could resolve this ticket by:

  1. Lifting-and-shift this macro definition into this folder.
  2. Changing the macro name to be {% macro default__equals(expr1, expr2) -%} (similar to this)
  3. Adding the appropriate dispatch definition (similar to this)