Open samdyzon opened 3 years ago
Thanks for the report, @samdyzon !
Let me take a look at this one, but maybe the changes to this bug will be applied through the pyspark.pandas
package in the Apache Spark, after the Apache Spark 3.2 release.
This is because now we're porting the Koalas into PySpark, and please refer to the SPIP: Support pandas API layer on PySpark for more detail.
The koalas implementation of DataFrameGroupBy.describe() does not return the same results as the Pandas implementation, specifically in that it does not exclude NaN values transparently in the same way that Pandas does.
databricks.koalas.groupby.DataFrameGroupBy.describe - Koalas 1.7.0 documentation
Test Driver
Given a parquet file with columns "Domain" and "Measurement". "Domain" is a categorical value, and "Measurement" is a series of continuous, double-precision floating point values. There are NaN values throughout the "Measurement" column, and there is one value of "Domain" where all "Measurements" are NaN.
Code executed
Actual Output
Expected Output (as seen in Pandas)
Solution (not ideal)
In order to get a set of summary statistics from koalas that are the same as the result from pandas, one must change the execution code to:
Which returns:
Notice how the "Invalid" domain is no longer included in the results, which is a very different output compared to the Pandas output.
Update: This solution is not ideal, due to the fact that the user may want to run the describe method over multiple grouped fields. If there is another column, called "Density", the following does not work the same:
Why I created an issue
The documentation explains that the method will:
However, the method fails to handle the case of NaN values by returning summary statistics that are all NaN's (except for min values which seem to work just fine?).