NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] [Spark 4] Invalid results from Casting timestamps to integral types #11555

Open mythrocks opened 1 month ago

mythrocks commented 1 month ago

Description With ANSI off, when a TIMESTAMP column is cast to BYTE, the output from the spark-rapids plugin differs from that of Apache Spark 4.

Repro Consider the following single-row dataframe containing a single timestamp. When read back through the plugin on Spark 4, we would expect a null row:

sql(" select timestamp('4106-11-27 08:07:45.336457') as t").write.mode("overwrite").parquet("/tmp/myth/repro")

spark.conf.set("spark.sql.ansi.enabled", false)

spark.read.parquet("/tmp/myth/repro").selectExpr("CAST(t AS BYTE)").show

On Apache Spark 4, this results in a null row:

+----+
|   t|
+----+
|NULL|
+----+

With the RAPIDS plugin, the result is non-null:

+---+
|  t|
+---+
| 81|
+---+

Expected behaviour Note that the plugin's result matches the result from Spark 3.x. Spark 4's behaviour seems to be a departure from Spark 3.x. Ideally, the plugin's behaviour would match that of the Spark version with which it's running.