NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

Integration tests failing for non-utc timestamp in date_test.py #11539

Open nartal1 opened 5 days ago

nartal1 commented 5 days ago

Below nightly integration tests are failing:


FAILED ../../src/main/python/date_time_test.py::test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyyMMdd][DATAGEN_SEED=1727466731, TZ=America/Punta_Arenas] - AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyyMMdd)']

FAILED ../../src/main/python/date_time_test.py::test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyymmdd][DATAGEN_SEED=1727466731, TZ=America/Punta_Arenas, INJECT_OOM] - AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyymmdd)']


Additional info of failing tests:
[2024-09-27T22:01:34.718Z] =================================== FAILURES ===================================

[2024-09-27T22:01:34.718Z] _ test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyyMMdd] _

[2024-09-27T22:01:34.718Z] [gw4] linux -- Python 3.10.15 /opt/conda/bin/python

[2024-09-27T22:01:34.718Z] 

[2024-09-27T22:01:34.718Z] format = 'yyyyMMdd', data_gen_regexp = '([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])'

[2024-09-27T22:01:34.718Z] 

[2024-09-27T22:01:34.718Z]     @pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to https://github.com/NVIDIA/spark-rapids/issues/6839, please update after all time zones are supported")

[2024-09-27T22:01:34.718Z]     @pytest.mark.parametrize("format", ['yyyyMMdd', 'yyyymmdd'], ids=idfn)

[2024-09-27T22:01:34.718Z]     # these regexps exclude zero year, python does not like zero year

[2024-09-27T22:01:34.718Z]     @pytest.mark.parametrize("data_gen_regexp", ['([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])', '([0-9]{3}[1-9])([0-9]{4})'], ids=idfn)

[2024-09-27T22:01:34.718Z]     def test_formats_for_legacy_mode(format, data_gen_regexp):

[2024-09-27T22:01:34.719Z]         gen = StringGen(data_gen_regexp)

[2024-09-27T22:01:34.719Z] >       assert_gpu_and_cpu_are_equal_sql(

[2024-09-27T22:01:34.719Z]             lambda spark : unary_op_df(spark, gen),

[2024-09-27T22:01:34.719Z]             "tab",

[2024-09-27T22:01:34.719Z]             '''select unix_timestamp(a, '{}'),

[2024-09-27T22:01:34.719Z]                       from_unixtime(unix_timestamp(a, '{}'), '{}'),

[2024-09-27T22:01:34.719Z]                       date_format(to_timestamp(a, '{}'), '{}')

[2024-09-27T22:01:34.719Z]                from tab

[2024-09-27T22:01:34.719Z]             '''.format(format, format, format, format, format),

[2024-09-27T22:01:34.719Z]             {  'spark.sql.legacy.timeParserPolicy': 'LEGACY',

[2024-09-27T22:01:34.719Z]                'spark.rapids.sql.incompatibleDateFormats.enabled': True})

[2024-09-27T22:01:34.719Z] 

[2024-09-27T22:01:34.719Z] ../../src/main/python/date_time_test.py:469: 

[2024-09-27T22:01:34.719Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql

[2024-09-27T22:01:34.719Z]     assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect

[2024-09-27T22:01:34.719Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal

[2024-09-27T22:01:34.719Z]     assert_equal(from_cpu, from_gpu)

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:111: in assert_equal

[2024-09-27T22:01:34.719Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-09-27T22:01:34.719Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-09-27T22:01:34.719Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.719Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-09-27T22:01:34.719Z] 

[2024-09-27T22:01:34.719Z] cpu = -2311528634, gpu = -2311528635

[2024-09-27T22:01:34.719Z] float_check = . at 0x7f94e28a83a0>

[2024-09-27T22:01:34.719Z] path = [356, 'unix_timestamp(a, yyyyMMdd)']

[2024-09-27T22:01:34.719Z] 

[2024-09-27T22:01:34.719Z]     def _assert_equal(cpu, gpu, float_check, path):

[2024-09-27T22:01:34.719Z]         t = type(cpu)

[2024-09-27T22:01:34.719Z]         if (t is Row):

[2024-09-27T22:01:34.719Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):

[2024-09-27T22:01:34.719Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)

[2024-09-27T22:01:34.719Z]                 for field in cpu.__fields__:

[2024-09-27T22:01:34.719Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.719Z]             else:

[2024-09-27T22:01:34.719Z]                 for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z]         elif (t is list):

[2024-09-27T22:01:34.719Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z]             for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z]         elif (t is tuple):

[2024-09-27T22:01:34.719Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.719Z]             for index in range(len(cpu)):

[2024-09-27T22:01:34.719Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.719Z]         elif (t is pytypes.GeneratorType):

[2024-09-27T22:01:34.719Z]             index = 0

[2024-09-27T22:01:34.719Z]             # generator has no zip :( so we have to do this the hard way

[2024-09-27T22:01:34.719Z]             done = False

[2024-09-27T22:01:34.719Z]             while not done:

[2024-09-27T22:01:34.719Z]                 sub_cpu = None

[2024-09-27T22:01:34.719Z]                 sub_gpu = None

[2024-09-27T22:01:34.719Z]                 try:

[2024-09-27T22:01:34.719Z]                     sub_cpu = next(cpu)

[2024-09-27T22:01:34.719Z]                 except StopIteration:

[2024-09-27T22:01:34.719Z]                     done = True

[2024-09-27T22:01:34.719Z]     

[2024-09-27T22:01:34.719Z]                 try:

[2024-09-27T22:01:34.719Z]                     sub_gpu = next(gpu)

[2024-09-27T22:01:34.719Z]                 except StopIteration:

[2024-09-27T22:01:34.720Z]                     done = True

[2024-09-27T22:01:34.720Z]     

[2024-09-27T22:01:34.720Z]                 if done:

[2024-09-27T22:01:34.720Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-09-27T22:01:34.720Z]                 else:

[2024-09-27T22:01:34.720Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-09-27T22:01:34.720Z]     

[2024-09-27T22:01:34.720Z]                 index = index + 1

[2024-09-27T22:01:34.720Z]         elif (t is dict):

[2024-09-27T22:01:34.720Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-09-27T22:01:34.720Z]             # so sort the items to do our best with ignoring the order of dicts

[2024-09-27T22:01:34.720Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.720Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.720Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-09-27T22:01:34.720Z]         elif (t is int):

[2024-09-27T22:01:34.720Z] >           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-09-27T22:01:34.720Z] E           AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyyMMdd)']

[2024-09-27T22:01:34.720Z] 

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:78: AssertionError

[2024-09-27T22:01:34.720Z] ----------------------------- Captured stdout call -----------------------------

[2024-09-27T22:01:34.720Z] ### CPU RUN ###

[2024-09-27T22:01:34.720Z] ### GPU RUN ###

[2024-09-27T22:01:34.720Z] ### COLLECT: GPU TOOK 0.20502257347106934 CPU TOOK 0.25730133056640625 ###

[2024-09-27T22:01:34.720Z] --- CPU OUTPUT

[2024-09-27T22:01:34.720Z] +++ GPU OUTPUT

[2024-09-27T22:01:34.720Z] @@ -354,7 +354,7 @@

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=1851217200, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='20280830', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='20280830')

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] -Row(unix_timestamp(a, yyyyMMdd)=-2311528634, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001')

[2024-09-27T22:01:34.720Z] +Row(unix_timestamp(a, yyyyMMdd)=-2311528635, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001', date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)='18961001')

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z]  Row(unix_timestamp(a, yyyyMMdd)=None, from_unixtime(unix_timestamp(a, yyyyMMdd), yyyyMMdd)=None, date_format(to_timestamp(a, yyyyMMdd), yyyyMMdd)=None)

[2024-09-27T22:01:34.720Z] _ test_formats_for_legacy_mode[([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])-yyyymmdd] _

[2024-09-27T22:01:34.720Z] [gw4] linux -- Python 3.10.15 /opt/conda/bin/python

[2024-09-27T22:01:34.720Z] 

[2024-09-27T22:01:34.720Z] format = 'yyyymmdd', data_gen_regexp = '([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])'

[2024-09-27T22:01:34.720Z] 

[2024-09-27T22:01:34.720Z]     @pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to https://github.com/NVIDIA/spark-rapids/issues/6839, please update after all time zones are supported")

[2024-09-27T22:01:34.720Z]     @pytest.mark.parametrize("format", ['yyyyMMdd', 'yyyymmdd'], ids=idfn)

[2024-09-27T22:01:34.720Z]     # these regexps exclude zero year, python does not like zero year

[2024-09-27T22:01:34.720Z]     @pytest.mark.parametrize("data_gen_regexp", ['([0-9]{3}[1-9])([0-5][0-9])([0-3][0-9])', '([0-9]{3}[1-9])([0-9]{4})'], ids=idfn)

[2024-09-27T22:01:34.720Z]     def test_formats_for_legacy_mode(format, data_gen_regexp):

[2024-09-27T22:01:34.720Z]         gen = StringGen(data_gen_regexp)

[2024-09-27T22:01:34.720Z] >       assert_gpu_and_cpu_are_equal_sql(

[2024-09-27T22:01:34.720Z]             lambda spark : unary_op_df(spark, gen),

[2024-09-27T22:01:34.720Z]             "tab",

[2024-09-27T22:01:34.720Z]             '''select unix_timestamp(a, '{}'),

[2024-09-27T22:01:34.720Z]                       from_unixtime(unix_timestamp(a, '{}'), '{}'),

[2024-09-27T22:01:34.720Z]                       date_format(to_timestamp(a, '{}'), '{}')

[2024-09-27T22:01:34.720Z]                from tab

[2024-09-27T22:01:34.720Z]             '''.format(format, format, format, format, format),

[2024-09-27T22:01:34.720Z]             {  'spark.sql.legacy.timeParserPolicy': 'LEGACY',

[2024-09-27T22:01:34.720Z]                'spark.rapids.sql.incompatibleDateFormats.enabled': True})

[2024-09-27T22:01:34.720Z] 

[2024-09-27T22:01:34.720Z] ../../src/main/python/date_time_test.py:469: 

[2024-09-27T22:01:34.720Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql

[2024-09-27T22:01:34.720Z]     assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect

[2024-09-27T22:01:34.720Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal

[2024-09-27T22:01:34.720Z]     assert_equal(from_cpu, from_gpu)

[2024-09-27T22:01:34.720Z] ../../src/main/python/asserts.py:111: in assert_equal

[2024-09-27T22:01:34.720Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-09-27T22:01:34.721Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-09-27T22:01:34.721Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.721Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-09-27T22:01:34.721Z] 

[2024-09-27T22:01:34.721Z] cpu = -2335201634, gpu = -2335201635

[2024-09-27T22:01:34.721Z] float_check = . at 0x7f94e22bd870>

[2024-09-27T22:01:34.721Z] path = [356, 'unix_timestamp(a, yyyymmdd)']

[2024-09-27T22:01:34.721Z] 

[2024-09-27T22:01:34.721Z]     def _assert_equal(cpu, gpu, float_check, path):

[2024-09-27T22:01:34.721Z]         t = type(cpu)

[2024-09-27T22:01:34.721Z]         if (t is Row):

[2024-09-27T22:01:34.721Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):

[2024-09-27T22:01:34.721Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)

[2024-09-27T22:01:34.721Z]                 for field in cpu.__fields__:

[2024-09-27T22:01:34.721Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-09-27T22:01:34.721Z]             else:

[2024-09-27T22:01:34.721Z]                 for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z]         elif (t is list):

[2024-09-27T22:01:34.721Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z]             for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z]         elif (t is tuple):

[2024-09-27T22:01:34.721Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-09-27T22:01:34.721Z]             for index in range(len(cpu)):

[2024-09-27T22:01:34.721Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-09-27T22:01:34.721Z]         elif (t is pytypes.GeneratorType):

[2024-09-27T22:01:34.721Z]             index = 0

[2024-09-27T22:01:34.721Z]             # generator has no zip :( so we have to do this the hard way

[2024-09-27T22:01:34.721Z]             done = False

[2024-09-27T22:01:34.721Z]             while not done:

[2024-09-27T22:01:34.721Z]                 sub_cpu = None

[2024-09-27T22:01:34.721Z]                 sub_gpu = None

[2024-09-27T22:01:34.721Z]                 try:

[2024-09-27T22:01:34.721Z]                     sub_cpu = next(cpu)

[2024-09-27T22:01:34.721Z]                 except StopIteration:

[2024-09-27T22:01:34.721Z]                     done = True

[2024-09-27T22:01:34.721Z]     

[2024-09-27T22:01:34.721Z]                 try:

[2024-09-27T22:01:34.721Z]                     sub_gpu = next(gpu)

[2024-09-27T22:01:34.721Z]                 except StopIteration:

[2024-09-27T22:01:34.721Z]                     done = True

[2024-09-27T22:01:34.721Z]     

[2024-09-27T22:01:34.721Z]                 if done:

[2024-09-27T22:01:34.721Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-09-27T22:01:34.721Z]                 else:

[2024-09-27T22:01:34.721Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-09-27T22:01:34.721Z]     

[2024-09-27T22:01:34.721Z]                 index = index + 1

[2024-09-27T22:01:34.721Z]         elif (t is dict):

[2024-09-27T22:01:34.721Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-09-27T22:01:34.721Z]             # so sort the items to do our best with ignoring the order of dicts

[2024-09-27T22:01:34.721Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.721Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-09-27T22:01:34.721Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-09-27T22:01:34.721Z]         elif (t is int):

[2024-09-27T22:01:34.721Z] >           assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-09-27T22:01:34.721Z] E           AssertionError: GPU and CPU int values are different at [356, 'unix_timestamp(a, yyyymmdd)']

[2024-09-27T22:01:34.721Z] 

[2024-09-27T22:01:34.721Z] ../../src/main/python/asserts.py:78: AssertionError

[2024-09-27T22:01:34.721Z] ----------------------------- Captured stdout call -----------------------------

[2024-09-27T22:01:34.721Z] ### CPU RUN ###

[2024-09-27T22:01:34.721Z] ### GPU RUN ###

[2024-09-27T22:01:34.721Z] ### COLLECT: GPU TOOK 0.20046424865722656 CPU TOOK 0.2261199951171875 ###

[2024-09-27T22:01:34.721Z] --- CPU OUTPUT

[2024-09-27T22:01:34.721Z] +++ GPU OUTPUT

[2024-09-27T22:01:34.721Z] @@ -354,7 +354,7 @@

[2024-09-27T22:01:34.721Z]  Row(unix_timestamp(a, yyyymmdd)=1832814480, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='20280830', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='20280830')

[2024-09-27T22:01:34.721Z]  Row(unix_timestamp(a, yyyymmdd)=188302419960, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='79374625', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='79374625')

[2024-09-27T22:01:34.722Z]  Row(unix_timestamp(a, yyyymmdd)=106473843300, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='53443509', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='53443509')

[2024-09-27T22:01:34.722Z] -Row(unix_timestamp(a, yyyymmdd)=-2335201634, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='18961001', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='18961001')

[2024-09-27T22:01:34.722Z] +Row(unix_timestamp(a, yyyymmdd)=-2335201635, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='18961001', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='18961001')

[2024-09-27T22:01:34.722Z]  Row(unix_timestamp(a, yyyymmdd)=54974432640, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='37124427', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='37124427')

[2024-09-27T22:01:34.722Z]  Row(unix_timestamp(a, yyyymmdd)=138914363880, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='63721809', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='63721809')

[2024-09-27T22:01:34.722Z]  Row(unix_timestamp(a, yyyymmdd)=252931605180, from_unixtime(unix_timestamp(a, yyyymmdd), yyyymmdd)='99851331', date_format(to_timestamp(a, yyyymmdd), yyyymmdd)='99851331')

res-life commented 3 days ago

Reproduce

select unix_timestamp('18961001', 'yyyyMMdd')
with config:
  'spark.sql.legacy.timeParserPolicy': 'LEGACY',
  'spark.rapids.sql.incompatibleDateFormats.enabled': True
with timezone:
  America/Punta_Arenas

CPU: -2311528634 GPU: -2311528635

The diff is one second. Note: Other timezones like Aisa/Shanghai, Iran do not have this issue.

Analysis

Test Spark 330 shell

scala> import java.time._
import java.time._

scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.catalyst.util.DateTimeUtils

scala> val epochSeconds = LocalDateTime.of(1896,10,1,0,0,0).toInstant(ZoneOffset.UTC).getEpochSecond()
epochSeconds: Long = -2311545600

scala> val micros = epochSeconds * 1000000
micros: Long = -2311545600000000

scala> val expected = DateTimeUtils.convertTz(micros, ZoneId.of("America/Punta_Arenas"),  ZoneId.of("UTC"))/1000000L
expected: Long = -2311528635    //  this is the same with GPU output

test non-LEACY mode

Save the following line into a parquet "1896-10-01" select unix_timestamp(col, 'yyyy-MM-dd') from tab Results are correct:

CPU: -2311528635
GPU: -2311528635

conclusion

This is a corner case in LEGACY mode; Non-LEGACY does not have this problem. Other timezones like Aisa/Shanghai, Iran do not have this issue

TODO

Debug into Spark to see what happened in LEGACY mode.

res-life commented 3 days ago

Spark has different behavior between LEGACY and non-LEGACY mode:

Spark330:

scala> spark.conf.set("spark.sql.session.timeZone", "America/Punta_Arenas")

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")

scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
|                       -2311528635|
+----------------------------------+

scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

scala> spark.sql("select unix_timestamp('18961001', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(18961001, yyyyMMdd)|
+----------------------------------+
|                       -2311535143|
+----------------------------------+
res-life commented 3 days ago

We already documented that LEGACY mode has several limitations:

LEGACY timeParserPolicy support has the following limitations when running on the GPU:

Only 4 digit years are supported
The proleptic Gregorian calendar is used instead of the hybrid Julian+Gregorian calendar that Spark uses in legacy mode
When format is yyyyMMdd, GPU only supports 8 digit strings. Spark supports like 7 digit 2024101 string while GPU does not support.