Velox evaluates date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM') to '12345-07', whereas vanilla Spark evaluates the same expression to '+12345-07'. This can be an issue because unix_timestamp in vanilla Spark only supports '+12345-07'. If date_format is executed in Velox and the result is used as an argument to unix_timestamp in vanilla Spark, there will be a failure.
// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01' c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")
// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
// +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:timestamp>
// Use collect() instead of show(), as show() makes the function run in vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])
spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...
Spark uses java.time.format.DateTimeFormatter for date_format.
Backend
VL (Velox)
Bug description
Velox evaluates
date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM')
to'12345-07'
, whereas vanilla Spark evaluates the same expression to'+12345-07'
. This can be an issue becauseunix_timestamp
in vanilla Spark only supports'+12345-07'
. Ifdate_format
is executed in Velox and the result is used as an argument tounix_timestamp
in vanilla Spark, there will be a failure.Spark uses
java.time.format.DateTimeFormatter
fordate_format
.OpenJDK 1.8.0_402, 11.0.22, 21.0.2 all behave the same. It is not documented in the class in general, but for some constants it is mentioned that years outside of 0000-9999 will have a prefixed positive or negative symbol.
Five-digit years should be extremely rare in real world applications, but it's breaking Delta unit tests.
The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.
Spark version
None
Spark configurations
spark.plugins=org.apache.gluten.GlutenPlugin spark.gluten.enabled=true spark.gluten.sql.columnar.backend.lib=velox spark.memory.offHeap.enabled=true spark.memory.offHeap.size=28g
System information
Velox System Info v0.0.2 Commit: 45dc46a9dd8a4197876da4c661d856f73d31673f CMake Version: 3.28.3 System: Linux-6.5.0-1018-azure Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 11.4.0 C Compiler: /usr/bin/cc C Compiler Version: 11.4.0 CMake Prefix Path: /usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
No response