PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.65k stars 208 forks source link

Null safe equality inconsistency #4289

Open CertainLach opened 4 months ago

CertainLach commented 4 months ago

What happened?

PRQL is not consistent with null comparisons, it transforms == null and != null to IS NULL and IS NOT NULL, yet does not emit similar things for two column comparisons.

Given that == is already null safe in some contexts, it should also be null-safe in others, I.e using

PRQL input

prql target:sql.postgres
from test
derive {
  a == "a",
  a == b,
  a == null && b == null
}

SQL output

SELECT
  *,
  a = 'a', -- Ok: nothing wrong will happen if a is NULL
  a = b, -- Error: Wrong result if a and b is NULL
  a IS NULL
  AND b IS NULL
FROM
  test

Expected SQL output

SELECT
  *,
  a = 'a',
  a IS NOT DISTINCT FROM b, -- Ok: We don't know if a or b might be NULL, so we need to be strict here
  a IS NULL
  AND b IS NULL
FROM
  test

MVCE confirmation

Anything else?

No response

max-sixty commented 4 months ago

Yes, agree this is unfortunate.

Worse — IS NOT DISTINCT FROM isn't supported by DBs such as BigQuery & ClickHouse.

So I think we need to consider that either columns are compared differently to scalars, or we have to go back to three-valued logic like SQL does...

CertainLach commented 4 months ago

BigQuery supports it: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#is_distinct And ClickHouse support is partially implemented, in the same way as in MySQL: https://github.com/ClickHouse/ClickHouse/issues/54499

So I think we need to consider that either columns are compared differently to scalars

It is not only the columns, it is any expression which differs from literal NULL:

let b = null + 1
let c = null
from test
derive {
  a != null,
  a != b,
  a != c,
}
SELECT
  *,
  a IS NOT NULL,
  a <> NULL + 1,
  a IS NOT NULL
FROM
  test

Special-casing (_, Eq | Ne, Literal::Null) just feels wrong here.

I agree implementing this feature for every database might be complicated, and the universally-supported syntax is not good, and might feel clunky in many cases, but maybe this comparison should not be the default behavior?

Maybe the compiler should display error when it sees obviously wrong comparison (Which is currently being translated to IS NULL/IS NOT NULL), and suggest to use other method for NULL checking?

Maybe even some different operator, I.e ===, which will in turn translate to IS NULL/IS NOT NULL (like the current ==, in cases when one side of comparison is NULL), and to IS NOT DISTINCT FROM or very verbose universal syntax? Such operator with similar purposes exists in JS, PHP, Dart and Julia. I have taken this route in my postgres migration generator pet-project, because it feels like it makes most sense in case of SQL with default three-valued-logic: https://raw.githubusercontent.com/CertainLach/immigrant/master/crates/generator-postgres/examples/kitchen_sink.schema

max-sixty commented 4 months ago

BigQuery supports it: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#is_distinct

Thanks, my mistake

It is not only the columns, it is any expression which differs from literal NULL:

Yes, great examples.

I agree implementing this feature for every database might be complicated, and the universally-supported syntax is not good, and might feel clunky in many cases, but maybe this comparison should not be the default behavior?

Maybe the compiler should display error when it sees obviously wrong comparison (Which is currently being translated to IS NULL/IS NOT NULL), and suggest to use other method for NULL checking?

Though how would be tell if a user is doing something obviously wrong?

Maybe even some different operator, I.e ===, which will in turn translate to IS NULL/IS NOT NULL (like the current ==, in cases when one side of comparison is NULL), and to IS NOT DISTINCT FROM or very verbose universal syntax? Such operator with similar purposes exists in JS, PHP, Dart and Julia.

Yup. the exact operator doesn't matter so much — I think this is just a tradeoff of "inherit small annoyances of SQL" vs. "try and create a better world, but then try and plug leaky abstractions"...

CertainLach commented 4 months ago

Though how would be tell if a user is doing something obviously wrong?

Well, If == will become completly not null-safe (Which it already is, it only handles null-safety in one case), then the code which is currently being translated to IS NULL / IS NOT NULL should turn into errors, because in SQL both = NULL and <> NULL comparisons are useless

Error: 
   ╭─[:3:8]
   │
 3 │   b == null
   │     ─┬ 
   │      ╰─── equality operator is not null-safe, using it to compare with null is incorrect.
   │ 
   │ Help: did you mean to use strict equality `===` operator?
───╯
max-sixty commented 4 months ago

then the code which is currently being translated to IS NULL / IS NOT NULL should turn into errors, because in SQL both = NULL and <> NULL comparisons are useless

Oh yes totally! Agree if we change to another operator.


So I think we're left with the decisions between:

Any thoughts @PRQL/prql-team ?

eitsupi commented 4 months ago

Here is one of the past discussions I remember: #905

On a related note, does the PRQL need to consider the treatment of NaN? https://en.wikipedia.org/wiki/IEEE_754

snth commented 4 months ago

As I tried to argue in https://github.com/PRQL/prql/issues/905#issuecomment-1866809983, I think having a representation of "missing" is key to data analysis and the default behaviour should be that null == null does NOT evaluate to True. I did previously argue that special-casing (_, Eq | Ne, Literal::Null) was worth keeping but that would go away with the introduction of a new operator like ===. Therefore I think I largely agree with @CertainLach .


My preferred solution would be:

Introduce ===

Have a == null generate an error

This would be a breaking change and the error should say that a === null is now available for this purpose.

Keep existing 3VL for ==

The reason being that this is key for data analysis and should be the default. For example I want to know when I've accidentally introduced NULLs in a LEFT JOIN and I don't want this silently eliminated unless I explicitly asked for it.

eitsupi commented 4 months ago

I too think that null == null evaluated as true would be confusing. The following occurs:

let x = 1
let y = 2
let z = null

x == y # false
z == z # true
(x + z) == (y + z) # true!?
vanillajonathan commented 4 months ago

An except about null handling from page 246 in The Third Manifesto.

The result of the Tutorial D expression

T WHERE I ≠ SQL_INTEGER ('20')

certainly does include those tuples of T for which the I value is SQL_INTEGER ('NULL'), unlike its SQL counterpart:

SELECT * FROM T WHERE I <> 20

(The symbol “<>” here is SQL syntax for “≠”.) We venture to suggest that—at least in some cases—D expressions involving tables derived from SQL databases with nulls will give results that are more intuitively acceptable than those of SQL expressions.

Which suggest that the PRQL query:

from Cats
filter Name == "Garfield"

should translate into:

SELECT * FROM Cats WHERE Name = 'Garfield' OR Name IS NULL;
snth commented 4 months ago

@vanillajonathan I think you have a typo in your example. I don't think what you've written should be true but I think that one could reasonably expect that

from Cats
filter Name != "Garfield"

should translate into:

SELECT * FROM Cats WHERE Name != 'Garfield' OR Name IS NULL;

For someone only familiar with boolean logic where the Law of the Excluded Middle holds, that would certainly be the expectation.

Stated differently, in such a world you would expect the following query to give you back the original dataset data:

let data = [{a=0}, {a=1}, {a=null}]

let where_true = (from data | filter a==1)
let where_false = (from data | filter a!=1)

from where_true
append where_false

However if you run this in the playground you will find that that's not the case.

The introduction of "missingness" into data analysis takes the view that when an observation is missing, we can't say much about it. So if we have a cat for which we don't know the name, we can't say whether the name is "Garfield" or not, but likewise we can't guarantee that it is not "Garfield". Therefore we need to introduce a 3 Value Logic where propositions are either TRUE, FALSE, or NULL.

So in order to recover our orginal dataset data we have to do the following:

let where_null = (from data | filter a==null)

from where_true
append where_false
append where_null

Another way to think about it is that if some of the missing data is filled in over time, you might expect your where_true set to grow in that case but you don't want your where_false set to shrink. That is because as your available information grows, you can expect additional propositions to become true, but previously true propositions shouldn't change their status and become false.

For example, say you have a table of mushroom species with an attribute is_poisonous which is either TRUE, FALSE, or NULL in cases where it hasn't been determined yet.

let good_to_eat = (
  from mushrooms
  filter is_poisonous != true
  )

You don't want your good_to_eat set to shrink at a later time after you've already consumed some mushrooms from that set. Therefore it's better to err on the side of caution, both in NULL comparisons, and when consuming unknown mushrooms. 🍄

With the introduction of the === operator as proposed, you would recover a binary logic, and that is fine, since that's what you've explicitly asked for in that case. So there

from data
filter a===1
append (from data | filter !== 1)

should recover the original set data.

vanillajonathan commented 4 months ago

I like the mushroom example. 👍️ Nullable booleans are really mean and perhaps shouldn't exist as they're better modeled as an enum with the variants Yes, No, and Undetermined.