[FEA] Implement kernel to support non-UTC time zone for LEGACY mode.

pxLi commented 4 days ago

Describe the bug

this is failing non-utc test in 24.10, first seen in rapids_it-non-utc-pre_release, run:123

FAILED ../../src/main/python/date_time_test.py::test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1728130368, TZ=Africa/Casablanca] - AssertionError: GPU and CPU int values are different at [14, 'unix_timestamp(a, yyyyMMdd)']

cpu = 231388876800, gpu = 231388873200

@pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to https://github.com/NVIDIA/spark-rapids/issues/6839, please update after all time zones are supported")
    # Test years after 1900, refer to issues: https://github.com/NVIDIA/spark-rapids/issues/11543, https://github.com/NVIDIA/spark-rapids/issues/11539
    def test_yyyyMMdd_format_for_legacy_mode():
        gen = StringGen('(19[0-9]{2}|[2-9][0-9]{3})([0-9]{4})')
>       assert_gpu_and_cpu_are_equal_sql(
            lambda spark : unary_op_df(spark, gen),
            "tab",
            '''select unix_timestamp(a, 'yyyyMMdd'),
                      from_unixtime(unix_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd'),
                      date_format(to_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd')
               from tab
            ''',
            {'spark.sql.legacy.timeParserPolicy': 'LEGACY',
             'spark.rapids.sql.incompatibleDateFormats.enabled': True})

../../src/main/python/date_time_test.py:466: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql
    assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)
../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect
    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal
    assert_equal(from_cpu, from_gpu)
../../src/main/python/asserts.py:111: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
../../src/main/python/asserts.py:43: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
../../src/main/python/asserts.py:36: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cpu = 231388876800, gpu = 231388873200
float_check = <function get_float_check.<locals>.<lambda> at 0x7f7b98cebbe0>
path = [14, 'unix_timestamp(a, yyyyMMdd)']

Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.

Expected behavior Pass or ignore the case

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context Add any other context about the problem here.

pxLi commented 4 days ago

also cc @res-life to help, this case is still unstable in non-utc environment

res-life commented 2 days ago

Spark behavior:

scala> spark.conf.set("spark.sql.session.timeZone", "Africa/Casablanca")

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
|                       71910716400|
+----------------------------------+

scala> 

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
|                       71910720000|
+----------------------------------+

GPU limitation

GPU kernel is consistent with the CORRECTED mode, does not fully support LEGACY mode.

Follow-up

For LEGACY mode, need to implement: Spark code for LEGACY mode: Link It uses SimpleDateFormat:

class LegacySimpleTimestampFormatter(
    pattern: String,
    zoneId: ZoneId,
    locale: Locale,
    lenient: Boolean = true) extends TimestampFormatter {
  @transient private lazy val sdf = {
    val formatter = new SimpleDateFormat(pattern, locale)
    formatter.setTimeZone(TimeZone.getTimeZone(zoneId))
    formatter.setLenient(lenient)
    formatter
  }

  override def parse(s: String): Long = {
    fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
  }

Workaround

Disable this case for branch 24.10 when TZ is not UTC or Asia/Shanghai Update document to clarify that not all non-DST(daylight saving time) time zones are supported, only tested Asia/Shanghai timezone.

res-life commented 2 days ago

Let's use this issue to track the support for Africa/Casablanca time zone in legacy mode.

pxLi commented 2 days ago

thanks! retargeted issue to 24.12

revans2 commented 13 hours ago

We do not want to implement a kernel just for Africa/Casablanca unless we implement it for all time zones!!!

https://github.com/NVIDIA/spark-rapids/issues/6839

We picked Africa/Casablanca as an arbitrary time zone. It is one of the more complicated ones in terms of rules, but it is not there because a customer needs it. If we are going to develop a solution to a problem like this we want to develop a general purpose solution.

res-life commented 2 hours ago

The issue https://github.com/NVIDIA/spark-rapids/issues/6839 is for non-LEGACY mode. This requirement is for LEGACY mode. So this is another requirement.

For legacy mode, from the code It invokes LegacySimpleTimestampFormatter and gets a java.sql.Timestamp, then invokes fromJavaTimestamp and rebaseJulianToGregorianMicros. So this it will be another kernel.

Spark has different behaviors between LEGACY mode and non-LEGACY mode.

res-life commented 2 hours ago

I think the priority for this issue is low.

NVIDIA / spark-rapids