NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

Closed pxLi closed 9 months ago

pxLi commented 10 months ago

Describe the bug first seen in rapids_integration-dev-github, build ID 863 (jdk11 runtime + spark 330)

[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-True][DATAGEN_SEED=1701225104, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_dataframe_api[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":443,"minValues":{"a":0,"b":"a","c":"\\u0001\\tÛá®oÔhÌ\\u0010yÎCÎô\\u0012\x94³EB\x82>à\\u00062%.o\\u001D¥"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":447,"minValues":{"a":0,"b":"a","c":"\\u0001saD\\u000FzkÍÄ|¤ó_\x90ÑO¹¸åßÔ\x82|µ~\x91òbVå"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.494Z] = 3 failed, 276 passed, 47 skipped, 4 xfailed, 40 warnings in 1468.57s (0:24:28) =

mismatched CPU and GPU output:

[2023-11-29T02:56:13.489Z] E           AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.489Z] E           CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.489Z] E           GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}

others

[2023-11-29T02:56:13.488Z] filename = '00000000000000000002.json'
[2023-11-29T02:56:13.488Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":537,"minValu..."ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}}
[2023-11-29T02:56:13.488Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":541,"minValu...000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}}

[2023-11-29T02:56:13.490Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.490Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":443,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}}
[2023-11-29T02:56:13.490Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":447,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}}

[2023-11-29T02:56:13.492Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.492Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...Values":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}}
[2023-11-29T02:56:13.492Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}}

Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.

Expected behavior A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

jlowe commented 9 months ago

I cannot reproduce this, and AFAIK it has not been seen since it occurred. My guess is that somehow the data was spread across tasks differently between the two runs, as the data was verified to be correct (i.e.: all rows were accounted for, ignoring ordering) before the metadata equality check afterwards failed.

jlowe commented 9 months ago

I found a way to reliably reproduce this:

SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.2.1-bin-hadoop3.2/ DATAGEN_SEED=1702052986 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.0.1 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_delete_dataframe_api and None-True" --delta_lake --debug_tmp_path

As suspected, the table contents as a whole are correct, but somehow some of the rows have been swizzled between the two output files between the CPU and GPU runs. AFAICT nothing is actually wrong semantically with the output produced by the GPU relative to the CPU, but it would be good to understand how we're reliably getting the rows crossed here. I suspect for some odd reason the GPU run is getting a different set of input files for the tasks than the CPU does.

jlowe commented 9 months ago

Debugged why this is failing for certain datagen seeds. The problem can occur when a particular datagen seed causes two or more Parquet files within a table to be generated with the same file size. ext4 and most other Linux filesystems will return a directory listing in an order influenced by the order the files are written to the directory, and it's not deterministic which files will be written before other files as tasks execute in parallel. The input files are sorted in descending order by file size, but when two or more files are the same size, the sorted ordering can be different between two directories that contain the same files but were created in a different order.

I do not see a way to really fix this other than use a datagen seed which is known to produce files that can be deterministically sorted, as was done in #10009, or change the test to not try to compare metadata. The latter would allow a lot of subtle bugs to slip through, so I think using a fixed datagen seed is the better route.

Closing this bug as setting a fixed seed as done by #10009 is the long-term solution.