Closed pxLi closed 9 months ago
I cannot reproduce this, and AFAIK it has not been seen since it occurred. My guess is that somehow the data was spread across tasks differently between the two runs, as the data was verified to be correct (i.e.: all rows were accounted for, ignoring ordering) before the metadata equality check afterwards failed.
I found a way to reliably reproduce this:
SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.2.1-bin-hadoop3.2/ DATAGEN_SEED=1702052986 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.0.1 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_delete_dataframe_api and None-True" --delta_lake --debug_tmp_path
As suspected, the table contents as a whole are correct, but somehow some of the rows have been swizzled between the two output files between the CPU and GPU runs. AFAICT nothing is actually wrong semantically with the output produced by the GPU relative to the CPU, but it would be good to understand how we're reliably getting the rows crossed here. I suspect for some odd reason the GPU run is getting a different set of input files for the tasks than the CPU does.
Debugged why this is failing for certain datagen seeds. The problem can occur when a particular datagen seed causes two or more Parquet files within a table to be generated with the same file size. ext4 and most other Linux filesystems will return a directory listing in an order influenced by the order the files are written to the directory, and it's not deterministic which files will be written before other files as tasks execute in parallel. The input files are sorted in descending order by file size, but when two or more files are the same size, the sorted ordering can be different between two directories that contain the same files but were created in a different order.
I do not see a way to really fix this other than use a datagen seed which is known to produce files that can be deterministically sorted, as was done in #10009, or change the test to not try to compare metadata. The latter would allow a lot of subtle bugs to slip through, so I think using a fixed datagen seed is the better route.
Closing this bug as setting a fixed seed as done by #10009 is the long-term solution.
Describe the bug first seen in rapids_integration-dev-github, build ID 863 (jdk11 runtime + spark 330)
mismatched CPU and GPU output:
others
Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.
Expected behavior A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context Add any other context about the problem here.