duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
136 stars 27 forks source link

Improve pandas groupby corr benchmark #37

Closed mroeschke closed 9 months ago

mroeschke commented 9 months ago

Hello!

I was reviewing the groupby pandas benchmarks and noticed the groupby correlation benchmark could be more efficient by eliminating the Series call in apply and renaming the column after instead

In [1]: import warnings; import pandas as pd; import numpy as np

In [2]: warnings.filterwarnings("ignore", category=FutureWarning)

In [3]: warnings.filterwarnings("ignore", category=RuntimeWarning)

In [4]: pd.__version__
Out[4]: '2.1.1'

In [5]: n = 10

In [6]: k = 10

In [7]: np.random.seed(123)

In [8]: df = pd.DataFrame({"key": np.random.randint(0, k, n), "x": np.random.rand(n), "y": np.random.rand(n)})

In [9]: df
Out[9]: 
   key         x         y
0    2  0.480932  0.531551
1    2  0.392118  0.531828
2    6  0.343178  0.634401
3    1  0.729050  0.849432
4    3  0.438572  0.724455
5    9  0.059678  0.611024
6    6  0.398044  0.722443
7    1  0.737995  0.322959
8    0  0.182492  0.361789
9    1  0.175452  0.228263

# existing benchmark
In [10]: df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: pd.Series({'r2': x.corr(numeric_only=True)['x']['y']**2}))
Out[10]: 
   key        r2
0    2  1.000000
1    6  1.000000
2    1  0.367864
3    3       NaN
4    9       NaN
5    0       NaN

# proposed benchmark
In [11]: df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: (x['x'].corr(x['y'])**2)).rename(columns={None: "r2"})
Out[11]: 
   key        r2
0    2  1.000000
1    6  1.000000
2    1  0.367864
3    3       NaN
4    9       NaN
5    0       NaN

In [12]: n = 10_000

In [13]: k = 100

In [14]: df = pd.DataFrame({"key": np.random.randint(0, k, n), "x": np.random.rand(n), "y": np.random.rand(n)})

# existing benchmark
In [15]: %timeit df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: pd.Series({'r2': x.corr(numeric_only=True)['x']['y']**2}))
10.3 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# proposed benchmark
In [16]: %timeit df.groupby(["key"], as_index=False, sort=False, observed=True, dropna=False).apply(lambda x: (x['x'].corr(x['y'])**2)).rename(columns={None: "r2"})
7.39 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jangorecki commented 9 months ago

Thanks, also benchplot dict needs to be updated

mroeschke commented 9 months ago

Thanks, also benchplot dict needs to be updated

Thanks, updated it as well

mroeschke commented 9 months ago

@jangorecki any other changes needed for this PR?

jangorecki commented 9 months ago

I have not been running the code but it looks OK

Tmonster commented 9 months ago

Hi Matthew, just going to wait for the tests to pass and I'll merge 👍

mroeschke commented 9 months ago

Thanks for merging!