NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] -0.0 vs 0.0 is a hot mess #294

Open revans2 opened 4 years ago

revans2 commented 4 years ago

This is related to #84 and is a super set of it.

Spark is a bit of a hot mess with support for floating point -0.0

Most SQL implementations normalize -0.0 to 0.0. Spark does this for the SQL parser, but not for the dataframe API. Also spark violates ieee spec where -0.0 != 0.0 This is because java Double.compare and Float.compare treat -0.0 as < 0.0

This is true everywhere except for a few cases. equi-joins and hash aggregate keys. Hive does not do these. It always assumes that they are different.

For cudf it follows ieee where they always end up being the same. This causes issues in both sort, comparison operators, and joins that are not equijoins.

I will file something against spark, but I don't have high hopes that anything will be fixed.

revans2 commented 4 years ago

I filed https://issues.apache.org/jira/browse/SPARK-32110 to document what I have found in spark.

mythrocks commented 4 years ago

Some findings when compared against Apache Hive 3.x:

  1. Literals: Both Hive CLI and SparkSQL treat the literals 0.0 and -0.0 as equivalent. i.e. 0.0 = -0.0is TRUE. SELECT 0.0 as a, -0.0 as b selects 0.0 and 0.0.
  2. From data sources/files: The Spark REPL (and Scala, I’m guessing) treat the same literals as distinct. We can use this to write -0.0 into a file. E.g. Seq((-0.0, 0.0)).toDF.write.orc() writes distinct values.
  3. Equi-join: Hive 3 does not normalize float/double. Joining 0.0 and -0.0 from ORC-file sources does not match rows. Spark normalizes, and thus matches.
  4. Inequality joins: Both Hive 3 and SparkSQL 3 matches on -0.0 < 0.0. This is because neither normalizes on non-equijoins.

So in this regard, the only material difference between Hive and SparkSQL is that on equijoins, Hive does not normalize, and treats -0.0 as distinct from 0.0. It is consistent(ly wrong?) within itself. Spark normalizes, but only for equijoin.

revans2 commented 4 years ago

I filed https://github.com/rapidsai/cudf/issues/6834 in cudf so we can work around things with bit-wise operations if possible. I believe that we should be able to make comparisons and sort match exactly with Spark. On joins we are going to have a much harder time, but we still might be able to do it. We need to be very careful with this though. -0.0 and the various NaN values are rather rare in real life. I am not sure if it is worth the added performance cost for sort to do this, and the join I am especially concerned about what it would take to make it work.