Open mythrocks opened 11 months ago
Duplicated by #9778 which has some details on why this fails.
Thank you, @jlowe. Looks like another xfail case. There are similar xfails in the tests already, pertaining to dataframe conversions through Pandas.
The xfail is being handled in #9677. From Issue #9778 , Jason pointed out:
On Databricks 13.3, nulls in the Pandas DataFrame (represented as NaNs) are being honored as nulls in the resulting Spark DataFrame when converting a Pandas DataFrame to a Spark DataFrame. Pandas thinks there are nulls in the data, and those nulls are propagating to the Spark DataFrame.
fastparquet loads the NaNs properly, but then when converting the data to pandas, pandas thinks the NaN values are null. This, in turn, causes spark.createDataFrame to produce corresponding nulls. When comparing this to the GPU direct load of the data that contains NaNs (not nulls), the test fails. The problem is not in the way the GPU loads the data, it's the way the NaNs get converted into nulls due to sending the data through pandas before converting to a Spark DataFrame.
Basically he is asking to fix the test case so NaNs do not become nulls when passing through Pandas.
The xfail is being handled in #9677.
Do we mean #9776? Then, no, it's not. #9776 is a different xfail
case, for timestamps. This current issue was for floating-point failures.
so NaNs do not become nulls when passing through Pandas...
I cannot do that in the short term. The shortest path to constructing a Spark dataframe from what's read by fastparquet
is to route it through Pandas. And Pandas does not distinguish between NaN and null values.
This will go into the already long list of fastparquet
/Pandas/Spark incompatible xfail conditions. We can revisit routing around Pandas when the higher priority tasks are sorted.
Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.
Ah, I just spoke with @jlowe, and looked more closely at #9677. I understand now: @jlowe has that xfailed
out already.
From a Databricks premerge build:
This seems similar to the other
DATAGEN_SEED
related failures in thepytest
s.