fireducks-dev / fireducks

Create an issue on FireDucks
336 stars 6 forks source link

Median inconsistent with pandas #31

Open hanslovsky opened 4 days ago

hanslovsky commented 4 days ago

When running DataFrame.median I get results that are inconsistent with pandas. Here is an example in ipython:

In [48]: import pandas as pd

In [49]: import numpy as np

In [50]: import fireducks.pandas as fd

In [51]: def make_data(data_frame: type[pd.DataFrame | fd.DataFrame]) -> pd.DataFrame | fd.DataFrame:
    ...:     rng = np.random.default_rng(42)
    ...:     arr = rng.normal(size=(500_000, 6))
    ...:     df = data_frame(arr)
    ...:     return df.assign(md=np.arange(arr.shape[0]) % 7)
    ...: 

In [52]: data_pd = make_data(pd.DataFrame)

In [53]: data_fd = make_data(fd.DataFrame)

In [54]: data_fd.to_pandas().equals(data_pd)
Out[54]: True

In [55]: data_fd.drop(columns=["md"]).median()
Out[55]: 
0    0.001122
1    0.000344
2   -0.003104
3    0.001529
4    0.000973
5    0.004693
dtype: float64

In [56]: data_pd.drop(columns=["md"]).median()
Out[56]: 
0    0.000668
1   -0.000008
2   -0.002785
3    0.000264
4   -0.000166
5    0.003933
dtype: float64

In [58]: !pip freeze | grep -E 'fireducks|pandas'
fireducks==1.1.0
pandas==2.2.2

Versions are fireducks==1.1.0 and pandas==2.2.2. If you see anything that I might be doing wrong, please let me know. Otherwise, this may be a bug.

hanslovsky commented 4 days ago

FWIW, numpy agrees with pandas:

In [59]: np.median(data_pd.drop(columns=["md"]).to_numpy(), axis=0)
Out[59]: 
array([ 6.68248565e-04, -8.03345326e-06, -2.78480562e-03,  2.64127024e-04,
       -1.65822133e-04,  3.93307614e-03])

In [60]: np.median(data_fd.drop(columns=["md"]).to_numpy(), axis=0)
Out[60]: 
array([ 6.68248565e-04, -8.03345326e-06, -2.78480562e-03,  2.64127024e-04,
       -1.65822133e-04,  3.93307614e-03])
qsourav commented 4 days ago

Hi @hanslovsky,

Thank you very much for reporting the issue.

The median() in FireDucks actually relies on arrow approximate_median() for better performance. Hence you are getting the difference in result.

It can instead be implemented as df.quantile(0.5) to produce similar result as in pandas/numpy.

df = make_data().drop(columns=["md"])
print(df.median())
print(df.quantile())

Output in case of FireDucks:

0    0.001122
1    0.000344
2   -0.003104
3    0.001529
4    0.000973
5    0.004693
dtype: float64

0    0.000668
1   -0.000008
2   -0.002785
3    0.000264
4   -0.000166
5    0.003933
Name: 0.5, dtype: float64

We will definitely consider fixing the issue. Thank you once again for reporting the same.

Thanks and Regards, Sourav Saha