Description
With ANSI off, when a TIMESTAMP column is cast to BYTE, the output from the spark-rapids plugin differs from that of Apache Spark 4.
Repro
Consider the following single-row dataframe containing a single timestamp. When read back through the plugin on Spark 4, we would expect a null row:
sql(" select timestamp('4106-11-27 08:07:45.336457') as t").write.mode("overwrite").parquet("/tmp/myth/repro")
spark.conf.set("spark.sql.ansi.enabled", false)
spark.read.parquet("/tmp/myth/repro").selectExpr("CAST(t AS BYTE)").show
On Apache Spark 4, this results in a null row:
+----+
| t|
+----+
|NULL|
+----+
With the RAPIDS plugin, the result is non-null:
+---+
| t|
+---+
| 81|
+---+
Expected behaviour
Note that the plugin's result matches the result from Spark 3.x. Spark 4's behaviour seems to be a departure from Spark 3.x.
Ideally, the plugin's behaviour would match that of the Spark version with which it's running.
Description With ANSI off, when a
TIMESTAMP
column is cast toBYTE
, the output from thespark-rapids
plugin differs from that of Apache Spark 4.Repro Consider the following single-row dataframe containing a single timestamp. When read back through the plugin on Spark 4, we would expect a null row:
On Apache Spark 4, this results in a null row:
With the RAPIDS plugin, the result is non-null:
Expected behaviour Note that the plugin's result matches the result from Spark 3.x. Spark 4's behaviour seems to be a departure from Spark 3.x. Ideally, the plugin's behaviour would match that of the Spark version with which it's running.