[VL] date_format returns wrong results

Backend

VL (Velox)

Bug description

Velox evaluates date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM') to '12345-07', whereas vanilla Spark evaluates the same expression to '+12345-07'. This can be an issue because unix_timestamp in vanilla Spark only supports '+12345-07'. If date_format is executed in Velox and the result is used as an argument to unix_timestamp in vanilla Spark, there will be a failure.

// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01' c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")

// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
//    +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:timestamp>

// Use collect() instead of show(), as show() makes the function run in vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])

spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...

Spark uses java.time.format.DateTimeFormatter for date_format.

import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter

DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345, 1, 1))
// "+12345"

OpenJDK 1.8.0_402, 11.0.22, 21.0.2 all behave the same. It is not documented in the class in general, but for some constants it is mentioned that years outside of 0000-9999 will have a prefixed positive or negative symbol.

Five-digit years should be extremely rare in real world applications, but it's breaking Delta unit tests.

The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.

Spark version

None

Spark configurations

spark.plugins=org.apache.gluten.GlutenPlugin spark.gluten.enabled=true spark.gluten.sql.columnar.backend.lib=velox spark.memory.offHeap.enabled=true spark.memory.offHeap.size=28g

System information

Velox System Info v0.0.2 Commit: 45dc46a9dd8a4197876da4c661d856f73d31673f CMake Version: 3.28.3 System: Linux-6.5.0-1018-azure Arch: x86_64 C++ Compiler: /usr/bin/c++ C++ Compiler Version: 11.4.0 C Compiler: /usr/bin/cc C Compiler Version: 11.4.0 CMake Prefix Path: /usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

apache / incubator-gluten