Open CertainLach opened 4 months ago
Yes, agree this is unfortunate.
Worse — IS NOT DISTINCT FROM
isn't supported by DBs such as BigQuery & ClickHouse.
So I think we need to consider that either columns are compared differently to scalars, or we have to go back to three-valued logic like SQL does...
BigQuery supports it: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#is_distinct And ClickHouse support is partially implemented, in the same way as in MySQL: https://github.com/ClickHouse/ClickHouse/issues/54499
So I think we need to consider that either columns are compared differently to scalars
It is not only the columns, it is any expression which differs from literal NULL:
let b = null + 1
let c = null
from test
derive {
a != null,
a != b,
a != c,
}
SELECT
*,
a IS NOT NULL,
a <> NULL + 1,
a IS NOT NULL
FROM
test
Special-casing (_, Eq | Ne, Literal::Null)
just feels wrong here.
I agree implementing this feature for every database might be complicated, and the universally-supported syntax is not good, and might feel clunky in many cases, but maybe this comparison should not be the default behavior?
Maybe the compiler should display error when it sees obviously wrong comparison (Which is currently being translated to IS NULL/IS NOT NULL), and suggest to use other method for NULL checking?
Maybe even some different operator, I.e ===
, which will in turn translate to IS NULL
/IS NOT NULL
(like the current ==, in cases when one side of comparison is NULL), and to IS NOT DISTINCT FROM
or very verbose universal syntax? Such operator with similar purposes exists in JS, PHP, Dart and Julia.
I have taken this route in my postgres migration generator pet-project, because it feels like it makes most sense in case of SQL with default three-valued-logic: https://raw.githubusercontent.com/CertainLach/immigrant/master/crates/generator-postgres/examples/kitchen_sink.schema
BigQuery supports it: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#is_distinct
Thanks, my mistake
It is not only the columns, it is any expression which differs from literal NULL:
Yes, great examples.
I agree implementing this feature for every database might be complicated, and the universally-supported syntax is not good, and might feel clunky in many cases, but maybe this comparison should not be the default behavior?
Maybe the compiler should display error when it sees obviously wrong comparison (Which is currently being translated to IS NULL/IS NOT NULL), and suggest to use other method for NULL checking?
Though how would be tell if a user is doing something obviously wrong?
Maybe even some different operator, I.e
===
, which will in turn translate toIS NULL
/IS NOT NULL
(like the current ==, in cases when one side of comparison is NULL), and toIS NOT DISTINCT FROM
or very verbose universal syntax? Such operator with similar purposes exists in JS, PHP, Dart and Julia.
Yup. the exact operator doesn't matter so much — I think this is just a tradeoff of "inherit small annoyances of SQL" vs. "try and create a better world, but then try and plug leaky abstractions"...
Though how would be tell if a user is doing something obviously wrong?
Well, If == will become completly not null-safe (Which it already is, it only handles null-safety in one case), then the code which is currently being translated to IS NULL
/ IS NOT NULL
should turn into errors, because in SQL both = NULL
and <> NULL
comparisons are useless
Error:
╭─[:3:8]
│
3 │ b == null
│ ─┬
│ ╰─── equality operator is not null-safe, using it to compare with null is incorrect.
│
│ Help: did you mean to use strict equality `===` operator?
───╯
then the code which is currently being translated to
IS NULL
/IS NOT NULL
should turn into errors, because in SQL both= NULL
and<> NULL
comparisons are useless
Oh yes totally! Agree if we change to another operator.
So I think we're left with the decisions between:
===
for identity and ==
for 3VL equality==
for both, but recognize that ==
will only work with nulls when there is a literal null
on one sideAny thoughts @PRQL/prql-team ?
Here is one of the past discussions I remember: #905
On a related note, does the PRQL need to consider the treatment of NaN? https://en.wikipedia.org/wiki/IEEE_754
As I tried to argue in https://github.com/PRQL/prql/issues/905#issuecomment-1866809983, I think having a representation of "missing" is key to data analysis and the default behaviour should be that null == null
does NOT evaluate to True. I did previously argue that special-casing (_, Eq | Ne, Literal::Null)
was worth keeping but that would go away with the introduction of a new operator like ===
. Therefore I think I largely agree with @CertainLach .
My preferred solution would be:
===
a === b <==> a IS NOT DISTINCT FROM b
a === b <==> CASE WHEN (a = b) or (a IS NULL AND b IS NULL) THEN 0 ELSE 1 END = 0
a == null
generate an errorThis would be a breaking change and the error should say that a === null
is now available for this purpose.
==
The reason being that this is key for data analysis and should be the default. For example I want to know when I've accidentally introduced NULLs in a LEFT JOIN and I don't want this silently eliminated unless I explicitly asked for it.
I too think that null == null
evaluated as true
would be confusing.
The following occurs:
let x = 1
let y = 2
let z = null
x == y # false
z == z # true
(x + z) == (y + z) # true!?
An except about null handling from page 246 in The Third Manifesto.
The result of the Tutorial D expression
T WHERE I ≠ SQL_INTEGER ('20')
certainly does include those tuples of T for which the I value is SQL_INTEGER ('NULL'), unlike its SQL counterpart:
SELECT * FROM T WHERE I <> 20
(The symbol “<>” here is SQL syntax for “≠”.) We venture to suggest that—at least in some cases—D expressions involving tables derived from SQL databases with nulls will give results that are more intuitively acceptable than those of SQL expressions.
Which suggest that the PRQL query:
from Cats
filter Name == "Garfield"
should translate into:
SELECT * FROM Cats WHERE Name = 'Garfield' OR Name IS NULL;
@vanillajonathan I think you have a typo in your example. I don't think what you've written should be true but I think that one could reasonably expect that
from Cats
filter Name != "Garfield"
should translate into:
SELECT * FROM Cats WHERE Name != 'Garfield' OR Name IS NULL;
For someone only familiar with boolean logic where the Law of the Excluded Middle holds, that would certainly be the expectation.
Stated differently, in such a world you would expect the following query to give you back the original dataset data
:
let data = [{a=0}, {a=1}, {a=null}]
let where_true = (from data | filter a==1)
let where_false = (from data | filter a!=1)
from where_true
append where_false
However if you run this in the playground you will find that that's not the case.
The introduction of "missingness" into data analysis takes the view that when an observation is missing, we can't say much about it. So if we have a cat for which we don't know the name, we can't say whether the name is "Garfield" or not, but likewise we can't guarantee that it is not "Garfield". Therefore we need to introduce a 3 Value Logic where propositions are either TRUE, FALSE, or NULL.
So in order to recover our orginal dataset data
we have to do the following:
let where_null = (from data | filter a==null)
from where_true
append where_false
append where_null
Another way to think about it is that if some of the missing data is filled in over time, you might expect your where_true
set to grow in that case but you don't want your where_false
set to shrink. That is because as your available information grows, you can expect additional propositions to become true, but previously true propositions shouldn't change their status and become false.
For example, say you have a table of mushroom species with an attribute is_poisonous
which is either TRUE, FALSE, or NULL in cases where it hasn't been determined yet.
let good_to_eat = (
from mushrooms
filter is_poisonous != true
)
You don't want your good_to_eat
set to shrink at a later time after you've already consumed some mushrooms from that set. Therefore it's better to err on the side of caution, both in NULL comparisons, and when consuming unknown mushrooms. 🍄
With the introduction of the ===
operator as proposed, you would recover a binary logic, and that is fine, since that's what you've explicitly asked for in that case. So there
from data
filter a===1
append (from data | filter !== 1)
should recover the original set data
.
I like the mushroom example. 👍️
Nullable booleans are really mean and perhaps shouldn't exist as they're better modeled as an enum with the variants Yes
, No
, and Undetermined
.
What happened?
PRQL is not consistent with null comparisons, it transforms
== null
and!= null
toIS NULL
andIS NOT NULL
, yet does not emit similar things for two column comparisons.Given that == is already null safe in some contexts, it should also be null-safe in others, I.e using
Sqlite's IS (https://www.sqlite.org/lang_expr.html#isisnot)it also supports standardIS NOT DISTINCT FROM
PRQL input
SQL output
Expected SQL output
MVCE confirmation
Anything else?
No response