Open revans2 opened 4 years ago
I filed https://issues.apache.org/jira/browse/SPARK-32110 to document what I have found in spark.
Some findings when compared against Apache Hive 3.x:
0.0
and -0.0
as equivalent. i.e. 0.0 = -0.0
is TRUE
. SELECT 0.0 as a, -0.0 as b
selects 0.0
and 0.0
.-0.0
into a file. E.g. Seq((-0.0, 0.0)).toDF.write.orc()
writes distinct values.0.0
and -0.0
from ORC-file sources does not match rows. Spark normalizes, and thus matches.-0.0
< 0.0
. This is because neither normalizes on non-equijoins.So in this regard, the only material difference between Hive and SparkSQL is that on equijoins, Hive does not normalize, and treats -0.0
as distinct from 0.0
. It is consistent(ly wrong?) within itself. Spark normalizes, but only for equijoin.
I filed https://github.com/rapidsai/cudf/issues/6834 in cudf so we can work around things with bit-wise operations if possible. I believe that we should be able to make comparisons and sort match exactly with Spark. On joins we are going to have a much harder time, but we still might be able to do it. We need to be very careful with this though. -0.0 and the various NaN values are rather rare in real life. I am not sure if it is worth the added performance cost for sort to do this, and the join I am especially concerned about what it would take to make it work.
This is related to #84 and is a super set of it.
Spark is a bit of a hot mess with support for floating point
-0.0
Most SQL implementations normalize
-0.0
to0.0
. Spark does this for the SQL parser, but not for the dataframe API. Also spark violates ieee spec where-0.0
!=0.0
This is because javaDouble.compare
andFloat.compare
treat-0.0
as <0.0
This is true everywhere except for a few cases. equi-joins and hash aggregate keys. Hive does not do these. It always assumes that they are different.
For cudf it follows ieee where they always end up being the same. This causes issues in both sort, comparison operators, and joins that are not equijoins.
I will file something against spark, but I don't have high hopes that anything will be fixed.