Open nartal1 opened 5 days ago
select unix_timestamp('18961001', 'yyyyMMdd')
with config:
'spark.sql.legacy.timeParserPolicy': 'LEGACY',
'spark.rapids.sql.incompatibleDateFormats.enabled': True
with timezone:
America/Punta_Arenas
CPU: -2311528634 GPU: -2311528635
The diff is one second. Note: Other timezones like Aisa/Shanghai, Iran do not have this issue.
scala> import java.time._
import java.time._
scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.catalyst.util.DateTimeUtils
scala> val epochSeconds = LocalDateTime.of(1896,10,1,0,0,0).toInstant(ZoneOffset.UTC).getEpochSecond()
epochSeconds: Long = -2311545600
scala> val micros = epochSeconds * 1000000
micros: Long = -2311545600000000
scala> val expected = DateTimeUtils.convertTz(micros, ZoneId.of("America/Punta_Arenas"), ZoneId.of("UTC"))/1000000L
expected: Long = -2311528635 // this is the same with GPU output
Save the following line into a parquet "1896-10-01" select unix_timestamp(col, 'yyyy-MM-dd') from tab Results are correct:
CPU: -2311528635
GPU: -2311528635
This is a corner case in LEGACY
mode; Non-LEGACY does not have this problem.
Other timezones like Aisa/Shanghai, Iran do not have this issue
Debug into Spark to see what happened in LEGACY
mode.
Spark has different behavior between LEGACY and non-LEGACY mode:
Spark330:
scala> spark.conf.set("spark.sql.session.timeZone", "America/Punta_Arenas")
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")
scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
| -2311528635|
+----------------------------------+
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
| -2311535143|
+----------------------------------+
We already documented that LEGACY
mode has several limitations:
LEGACY timeParserPolicy support has the following limitations when running on the GPU:
Only 4 digit years are supported
The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar that Spark uses in legacy mode
When format is yyyyMMdd, GPU only supports 8 digit strings. Spark supports like 7 digit 2024101 string while GPU does not support.
Below nightly integration tests are failing:
Additional info of failing tests: