[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1)

mythrocks commented 11 months ago

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [76, 'a']

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Double(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [12, 'a']

[2023-11-17T00:19:56.484Z] Starting with datagen test seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_DATAGEN_SEED to override.
[2023-11-17T00:19:56.484Z] Starting with OOM injection seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to override.
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Executing global initialization tasks before test launches
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Creating directory /home/ubuntu/spark-rapids/integration_tests/target/run_dir-20231116214942-eeD8/hive with permissions 0o777
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Skipping findspark init because on xdist master
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Struct(not_null)(('first', Integer(not_null)),('second', Float(not_null)))][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [37, 'a.second']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [38, 'a']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Double(not_null)][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [57, 'a']
[2023-11-17T00:19:56.484Z] = 5 failed, 19923 passed, 1045 skipped, 624 xfailed, 302 xpassed, 414 warnings in 9013.84s (2:30:13) =

This seems similar to the other DATAGEN_SEED related failures in the pytests.

jlowe commented 10 months ago

Duplicated by #9778 which has some details on why this fails.

mythrocks commented 10 months ago

Thank you, @jlowe. Looks like another xfail case. There are similar xfails in the tests already, pertaining to dataframe conversions through Pandas.

sameerz commented 10 months ago

The xfail is being handled in #9677. From Issue #9778 , Jason pointed out:

On Databricks 13.3, nulls in the Pandas DataFrame (represented as NaNs) are being honored as nulls in the resulting Spark DataFrame when converting a Pandas DataFrame to a Spark DataFrame. Pandas thinks there are nulls in the data, and those nulls are propagating to the Spark DataFrame.

fastparquet loads the NaNs properly, but then when converting the data to pandas, pandas thinks the NaN values are null. This, in turn, causes spark.createDataFrame to produce corresponding nulls. When comparing this to the GPU direct load of the data that contains NaNs (not nulls), the test fails. The problem is not in the way the GPU loads the data, it's the way the NaNs get converted into nulls due to sending the data through pandas before converting to a Spark DataFrame.

Basically he is asking to fix the test case so NaNs do not become nulls when passing through Pandas.

mythrocks commented 10 months ago

The xfail is being handled in #9677.

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

so NaNs do not become nulls when passing through Pandas...

I cannot do that in the short term. The shortest path to constructing a Spark dataframe from what's read by fastparquet is to route it through Pandas. And Pandas does not distinguish between NaN and null values.

This will go into the already long list of fastparquet/Pandas/Spark incompatible xfail conditions. We can revisit routing around Pandas when the higher priority tasks are sorted.

mythrocks commented 10 months ago

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

Ah, I just spoke with @jlowe, and looked more closely at #9677. I understand now: @jlowe has that xfailed out already.

NVIDIA / spark-rapids

[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767