Closed GaryShen2008 closed 5 months ago
I can reproduce this locally via:
SPARK_SUBMIT_FLAGS="--master local[8]" DATAGEN_SEED=1707683137 TEST_PARALLEL=0 SPARK_HOME=~/spark-3.3.3-bin-hadoop3/ PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.3.0 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_update_partitions and False" --delta_lake
I looked into the issue and it is a known non-deterministic behavior with sorting for partitioned write. In this case the CPU sampled and split the data such that one null row went to task A, but in the GPU case that null row went to task B. Output is semantically equivalent in the resulting table but the metadata won't line up because the individual files don't all have the same corresponding number of rows.
Will update the test to pin the data seed.
Describe the bug delta_lake_update_test.py::test_delta_update_partitions[['a', 'b']-False][DATAGEN_SEED=1707683137
Steps/Code to reproduce bug Run IT on Spark 3.3.1 with datagen_seed=1707683137.
Expected behavior Test case should succeed.