Open hanslovsky opened 4 days ago
FWIW, numpy agrees with pandas:
In [59]: np.median(data_pd.drop(columns=["md"]).to_numpy(), axis=0)
Out[59]:
array([ 6.68248565e-04, -8.03345326e-06, -2.78480562e-03, 2.64127024e-04,
-1.65822133e-04, 3.93307614e-03])
In [60]: np.median(data_fd.drop(columns=["md"]).to_numpy(), axis=0)
Out[60]:
array([ 6.68248565e-04, -8.03345326e-06, -2.78480562e-03, 2.64127024e-04,
-1.65822133e-04, 3.93307614e-03])
Hi @hanslovsky,
Thank you very much for reporting the issue.
The median() in FireDucks actually relies on arrow approximate_median() for better performance. Hence you are getting the difference in result.
It can instead be implemented as df.quantile(0.5) to produce similar result as in pandas/numpy.
df = make_data().drop(columns=["md"])
print(df.median())
print(df.quantile())
Output in case of FireDucks:
0 0.001122
1 0.000344
2 -0.003104
3 0.001529
4 0.000973
5 0.004693
dtype: float64
0 0.000668
1 -0.000008
2 -0.002785
3 0.000264
4 -0.000166
5 0.003933
Name: 0.5, dtype: float64
We will definitely consider fixing the issue. Thank you once again for reporting the same.
Thanks and Regards, Sourav Saha
When running
DataFrame.median
I get results that are inconsistent with pandas. Here is an example inipython
:Versions are
fireducks==1.1.0
andpandas==2.2.2
. If you see anything that I might be doing wrong, please let me know. Otherwise, this may be a bug.