apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Make modulos with negative float zero compat with other engines #11051

Open comphead opened 1 week ago

comphead commented 1 week ago

Describe the bug

DF returns NaN in modulo query below and is not compliant with major engines:

> select 1 % -0.0;
+------------------------+
| Int64(1) % Float64(-0) |
+------------------------+
| NaN                    |
+------------------------+

PG, Trino fails on such query, DuckDB, Spark returns NULL

@alamb @viirya @andygrove what would be expected behavior from DataFusion? We can also introduce a config to be PG compliant or PG/Trino compliant?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

parthchandra commented 1 week ago

Surprising that Postgres fails. According to the docs, Infinity, -Infinity, and NaN are legitimate floating point values. This is also the expected behavior per the ANSI SQL standard, afaik. NULL is not the right thing to return here.

alamb commented 1 week ago

Postgres fails:

andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs$ psql -h localhost -U postgres
psql (14.12 (Homebrew), server 16.1 (Debian 16.1-1.pgdg120+1))
WARNING: psql major version 14, server major version 16.
         Some psql features might not work.
Type "help" for help.

postgres=# select 1 % -0.0;
ERROR:  division by zero
alamb commented 1 week ago

I don't have a strong preference one way or the other, personally

comphead commented 1 week ago

I don't have a strong preference one way or the other, personally

I would go with Spark/DuckDB way :) if there is no other strong opinions, I'll make a fix soon

parthchandra commented 1 week ago
ERROR:  division by zero

I wasn't doubting it. :) But I am surprised.

@comphead I'll take back my 'NULL is not the right thing to return here' comment. It's better than failing for most users. Just one thing to keep in mind - Spark sql has a isnan function which should be consistent with this.

comphead commented 6 days ago

Thanks @parthchandra for the suggestion, I just checked we should be fine.

scala> spark.sql("select 1 % -0.0").show(false)
+---------+
|(1 % 0.0)|
+---------+
|NULL     |
+---------+

scala> spark.sql("select isnan(1 % -0.0)").show(false)
+----------------+
|isnan((1 % 0.0))|
+----------------+
|false           |
+----------------+
jonahgao commented 5 days ago

arrow-rs follows the rules of IEEE 754. If we intend to be compatible with other engines, many cases will also need to be modified, such as division by zero.

DataFusion CLI v39.0.0
> select 1.0/0.0;
+-------------------------+
| Float64(1) / Float64(0) |
+-------------------------+
| inf                     |
+-------------------------+

> select 0.0/0.0;
+-------------------------+
| Float64(0) / Float64(0) |
+-------------------------+
| NaN                     |
+-------------------------+

In PostgreSQL:

postgres=# select 1.0/0.0;
ERROR:  division by zero
postgres=# select 0.0/0.0;
ERROR:  division by zero
comphead commented 5 days ago

This is a good point @jonahgao so this gives me even more confidence to introduce a new DF param which controls should DF go with IEEE 754