NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
793 stars 231 forks source link

[FEA] Implement kernel to support non-UTC time zone for LEGACY mode. #11562

Open pxLi opened 4 days ago

pxLi commented 4 days ago

Describe the bug

this is failing non-utc test in 24.10, first seen in rapids_it-non-utc-pre_release, run:123

FAILED ../../src/main/python/date_time_test.py::test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1728130368, TZ=Africa/Casablanca] - AssertionError: GPU and CPU int values are different at [14, 'unix_timestamp(a, yyyyMMdd)']

cpu = 231388876800, gpu = 231388873200

@pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to https://github.com/NVIDIA/spark-rapids/issues/6839, please update after all time zones are supported")
    # Test years after 1900, refer to issues: https://github.com/NVIDIA/spark-rapids/issues/11543, https://github.com/NVIDIA/spark-rapids/issues/11539
    def test_yyyyMMdd_format_for_legacy_mode():
        gen = StringGen('(19[0-9]{2}|[2-9][0-9]{3})([0-9]{4})')
>       assert_gpu_and_cpu_are_equal_sql(
            lambda spark : unary_op_df(spark, gen),
            "tab",
            '''select unix_timestamp(a, 'yyyyMMdd'),
                      from_unixtime(unix_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd'),
                      date_format(to_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd')
               from tab
            ''',
            {'spark.sql.legacy.timeParserPolicy': 'LEGACY',
             'spark.rapids.sql.incompatibleDateFormats.enabled': True})

../../src/main/python/date_time_test.py:466: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql
    assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)
../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect
    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal
    assert_equal(from_cpu, from_gpu)
../../src/main/python/asserts.py:111: in assert_equal
    _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
../../src/main/python/asserts.py:43: in _assert_equal
    _assert_equal(cpu[index], gpu[index], float_check, path + [index])
../../src/main/python/asserts.py:36: in _assert_equal
    _assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cpu = 231388876800, gpu = 231388873200
float_check = <function get_float_check.<locals>.<lambda> at 0x7f7b98cebbe0>
path = [14, 'unix_timestamp(a, yyyyMMdd)']

Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.

Expected behavior Pass or ignore the case

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

pxLi commented 4 days ago

also cc @res-life to help, this case is still unstable in non-utc environment

res-life commented 2 days ago

Spark behavior:

scala> spark.conf.set("spark.sql.session.timeZone", "Africa/Casablanca")

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
|                       71910716400|
+----------------------------------+

scala> 

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
|                       71910720000|
+----------------------------------+

GPU limitation

GPU kernel is consistent with the CORRECTED mode, does not fully support LEGACY mode.

Follow-up

For LEGACY mode, need to implement: Spark code for LEGACY mode: Link It uses SimpleDateFormat:

class LegacySimpleTimestampFormatter(
    pattern: String,
    zoneId: ZoneId,
    locale: Locale,
    lenient: Boolean = true) extends TimestampFormatter {
  @transient private lazy val sdf = {
    val formatter = new SimpleDateFormat(pattern, locale)
    formatter.setTimeZone(TimeZone.getTimeZone(zoneId))
    formatter.setLenient(lenient)
    formatter
  }

  override def parse(s: String): Long = {
    fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
  }

Workaround

Disable this case for branch 24.10 when TZ is not UTC or Asia/Shanghai Update document to clarify that not all non-DST(daylight saving time) time zones are supported, only tested Asia/Shanghai timezone.

res-life commented 2 days ago

Let's use this issue to track the support for Africa/Casablanca time zone in legacy mode.

pxLi commented 2 days ago

thanks! retargeted issue to 24.12

revans2 commented 13 hours ago

We do not want to implement a kernel just for Africa/Casablanca unless we implement it for all time zones!!!

https://github.com/NVIDIA/spark-rapids/issues/6839

We picked Africa/Casablanca as an arbitrary time zone. It is one of the more complicated ones in terms of rules, but it is not there because a customer needs it. If we are going to develop a solution to a problem like this we want to develop a general purpose solution.

res-life commented 2 hours ago

The issue https://github.com/NVIDIA/spark-rapids/issues/6839 is for non-LEGACY mode. This requirement is for LEGACY mode. So this is another requirement.

For legacy mode, from the code It invokes LegacySimpleTimestampFormatter and gets a java.sql.Timestamp, then invokes fromJavaTimestamp and rebaseJulianToGregorianMicros. So this it will be another kernel.

Spark has different behaviors between LEGACY mode and non-LEGACY mode.

res-life commented 2 hours ago

I think the priority for this issue is low.