databricks / koalas

Koalas: pandas API on Apache Spark
Apache License 2.0
3.33k stars 356 forks source link

koalas.DataFrameGroupBy.describe() does not handle NaN values the same as pandas.DataFrameGroupBy.describe() #2154

Open samdyzon opened 3 years ago

samdyzon commented 3 years ago

The koalas implementation of DataFrameGroupBy.describe() does not return the same results as the Pandas implementation, specifically in that it does not exclude NaN values transparently in the same way that Pandas does.

databricks.koalas.groupby.DataFrameGroupBy.describe - Koalas 1.7.0 documentation

Test Driver

Given a parquet file with columns "Domain" and "Measurement". "Domain" is a categorical value, and "Measurement" is a series of continuous, double-precision floating point values. There are NaN values throughout the "Measurement" column, and there is one value of "Domain" where all "Measurements" are NaN.

Code executed

import databricks.koalas as ks

df = ks.read_parquet("data.parquet")
summary_data = df.groupby("Domain").describe()

Actual Output

            Measurement                                 
            count mean std     min 25% 50% 75% max
Domain                                                    
D1          124005.0  NaN NaN  0.5303 NaN NaN NaN NaN
D2          832179.0  NaN NaN  1.3449 NaN NaN NaN NaN
D3          264230.0  NaN NaN  0.9352 NaN NaN NaN NaN
D4          693275.0  NaN NaN  0.5718 NaN NaN NaN NaN
Invalid     4007714.0  NaN NaN     NaN NaN NaN NaN NaN
D5          468149.0  NaN NaN  0.0100 NaN NaN NaN NaN

Expected Output (as seen in Pandas)

import pandas as pd

df = pd.read_parquet("data.parquet")
summary_data = df.groupby("Domain").describe()
            Measurement                                                                     
            count    mean       std       min     25%        50%      75%        max
Domain                                                                                      
D4          11096.0  14.944859  5.194000  0.5718  11.652225  14.8970  18.546200  34.6043
D2          4768.0   11.521119  3.037552  1.3449   9.455400  11.5147  13.542450  28.0213
D3          2783.0   6.001364   2.713065  0.9352   4.605550   5.4178   6.691400  28.7833
Invalid     0.0      NaN        NaN       NaN      NaN        NaN       NaN      NaN
D5          15694.0  9.628944   2.339065  0.0100   8.001475   9.4183  11.010425  24.1078
D1          960.0    13.414239  6.563915  0.5303   7.556875  13.9978  18.599200  28.1761

Solution (not ideal)

In order to get a set of summary statistics from koalas that are the same as the result from pandas, one must change the execution code to:

summary = df[df["Measurement"].notnull()].groupby("Domain").describe()

Which returns:

            Measurement                                                                 
            count       mean       std     min      25%      50%      75%      max
Domain                                                                                  
D1          960.0  13.414239  6.563915  0.5303   7.5145  13.9289  18.5992  28.1761
D2          4768.0  11.521119  3.037552  1.3449   9.4530  11.5125  13.5417  28.0213
D3          2783.0   6.001364  2.713065  0.9352   4.6053   5.4178   6.6928  28.7833
D4          11096.0  14.944859  5.194000  0.5718  11.6511  14.8966  18.5459  34.6043
D5         15694.0   9.628944  2.339065  0.0100   8.0008   9.4183  11.0102  24.1078

Notice how the "Invalid" domain is no longer included in the results, which is a very different output compared to the Pandas output.

Update: This solution is not ideal, due to the fact that the user may want to run the describe method over multiple grouped fields. If there is another column, called "Density", the following does not work the same:

fields = ["Domain", "Density", "Measurement"]
summary = df[fields][df[fields].notnull()].groupby("Domain").describe()

print(summary)

            Density                                                         Measurement                                
            count      mean       std  min    25%    50%   75%   max        count mean std     min 25% 50% 75% max
Domain                                                                                                                  
D1          124005.0  2.587412  0.161539  1.9  2.515  2.720  2.72  2.72     124005.0  NaN NaN  0.5303 NaN NaN NaN NaN
D2          832179.0  2.546347  0.149968  1.9  2.515  2.515  2.72  2.72     832179.0  NaN NaN  1.3449 NaN NaN NaN NaN
D3          264230.0  2.590682  0.162684  1.9  2.515  2.720  2.72  2.72     264230.0  NaN NaN  0.9352 NaN NaN NaN NaN
D4          693275.0  2.555182  0.165583  1.9  2.515  2.515  2.72  2.72     693275.0  NaN NaN  0.5718 NaN NaN NaN NaN
Invalid     4007714.0 2.555750 0.141306   1.9  2.515  2.515  2.72  2.72     4007714.0  NaN NaN     NaN NaN NaN NaN NaN
D5          468149.0  2.586682  0.185971  1.9  2.515  2.720  2.72  2.72     468149.0  NaN NaN  0.0100 NaN NaN NaN NaN

Why I created an issue

The documentation explains that the method will:

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

However, the method fails to handle the case of NaN values by returning summary statistics that are all NaN's (except for min values which seem to work just fine?).

itholic commented 3 years ago

Thanks for the report, @samdyzon !

Let me take a look at this one, but maybe the changes to this bug will be applied through the pyspark.pandas package in the Apache Spark, after the Apache Spark 3.2 release.

This is because now we're porting the Koalas into PySpark, and please refer to the SPIP: Support pandas API layer on PySpark for more detail.