NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[BUG] parquet_testing_test.py failed on "AssertionError: GPU and CPU boolean values are different" #11715

Closed NvTimLiu closed 1 week ago

NvTimLiu commented 1 week ago

Describe the bug

This failure was first/only occurred on Databricks-12.2, let's keep an eye to see if it is reproducible.

 =================================== FAILURES ===================================
 _ test_parquet_testing_valid_files[confs0-/home/ubuntu/spark-rapids/thirdparty/parquet-testing/data/alltypes_tiny_pages.parquet] _
 [gw0] linux -- Python 3.8.10 /usr/bin/python

 path = '/home/ubuntu/spark-rapids/thirdparty/parquet-testing/data/alltypes_tiny_pages.parquet'
 confs = {'spark.rapids.sql.format.parquet.reader.footer.type': 'NATIVE', 'spark.sql.legacy.parquet.datetimeRebaseModeInRead': 'CORRECTED', uet.int96RebaseModeInRead': 'CORRECTED'}

     @pytest.mark.parametrize("path", gen_testing_params_for_valid_files())
     @pytest.mark.parametrize("confs", [_native_reader_confs, _java_reader_confs])
     @allow_non_gpu(*non_utc_allow)
     def test_parquet_testing_valid_files(path, confs):
 >       assert_gpu_and_cpu_are_equal_collect(lambda spark: spark.read.parquet(path), conf=confs)

 ../../src/main/python/parquet_testing_test.py:162: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 ../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect
     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, nc_before_compare=result_canonicalize_func_before_compare)
 ../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal
     assert_equal(from_cpu, from_gpu)
 ../../src/main/python/asserts.py:111: in assert_equal
     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
 ../../src/main/python/asserts.py:43: in _assert_equal
     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
 ../../src/main/python/asserts.py:36: in _assert_equal
     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

 cpu = False, gpu = True
 float_check = <function get_float_check.<locals>.<lambda> at 0x7f73eac6a5e0>
 path = [2047, 'bool_col']

     def _assert_equal(cpu, gpu, float_check, path):
         t = type(cpu)
         if (t is Row):
             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, 
                 for field in cpu.__fields__:
                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
             else:
                 for index in range(len(cpu)):
                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is list):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is tuple):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is pytypes.GeneratorType):
             index = 0
             # generator has no zip :( so we have to do this the hard way
             done = False
             while not done:
                 sub_cpu = None
                 sub_gpu = None
                 try:
                     sub_cpu = next(cpu)
                 except StopIteration:
                     done = True

                 try:
                     sub_gpu = next(gpu)
                 except StopIteration:
                     done = True

                 if done:
                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                 else:
                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

                 index = index + 1
         elif (t is dict):
             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
             # so sort the items to do our best with ignoring the order of dicts
             cpu_items = list(cpu.items()).sort(key=_RowCmp)
             gpu_items = list(gpu.items()).sort(key=_RowCmp)
             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
         elif (t is int):
             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
         elif (t is float):
             if (math.isnan(cpu)):
                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
             else:
                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
         elif isinstance(cpu, str):
             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
         elif isinstance(cpu, datetime):
             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
         elif isinstance(cpu, date):
             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
         elif isinstance(cpu, bool):
 >           assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
 E           AssertionError: GPU and CPU boolean values are different at [2047, 'bool_col']

 ../../src/main/python/asserts.py:91: AssertionError
  ---------------------------- Captured stderr setup -----------------------------
jlowe commented 1 week ago

@NvTimLiu please include the DATAGEN_SEED setting for any test failures ,as it may be crucial to reproduce it.

In this case , it was DATAGEN_SEED=1731408247

parthosa commented 1 week ago

Similar failure for Databricks Azure 13.3

integration_tests/src/test/resources/parquet-testing/data/alltypes_tiny_pages.parquet][DATAGEN_SEED=1731427707, TZ=UTC, INJECT_OOM]
    - AssertionError: GPU and CPU boolean values are different at [2855, 'bool_col']
pmattione-nvidia commented 1 week ago

I hit it locally, but it only hits randomly. Must be a timing bug, looking into it.

pmattione-nvidia commented 1 week ago

Fixed by this cuDF PR.