abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
261 stars 17 forks source link

Add null-aware cross correlation #176

Closed CangyuanLi closed 3 weeks ago

CangyuanLi commented 3 weeks ago

Basically, implement a faster version of pandas df.corr(). numpy.corrcoef is fast, but propagates NaNs. Polars df.corr does the same (it just uses np.corrcoef under the hood).

import numpy as np
import pandas as pd
import polars as pl

N_ROWS = 10
N_COLS = 3

A = np.random.rand(N_ROWS, N_COLS)  # random matrix
A.ravel()[
    np.random.choice(A.size, N_ROWS * N_COLS // 10, replace=False)
] = np.nan  # randomly set entries to NaN

df = pd.DataFrame(A, columns=[f"f{i}" for i in range(N_COLS)])
pf = pl.from_pandas(df)
print(df.corr())
print(pf.corr())
print(pf.select(pl.corr("f0", "f1")))
print(df.corr())
          f0        f1        f2
f0  1.000000 -0.178174 -0.424737
f1 -0.178174  1.000000  0.274256
f2 -0.424737  0.274256  1.000000
print(pf.corr())
┌─────┬─────┬─────┐
│ f0  ┆ f1  ┆ f2  │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
└─────┴─────┴─────┘

Basically, what pandas is doing is a loop over all combinations, dropping nulls between the two columns, and building the correlation matrix pair by pair. Note that you cannot drop nulls beforehand as the nulls between a and b may be different from a and c, and so on.

The plan is to wrap pl.corr(), which does handle nulls like pandas does. In the future, I would like to take a closer look at the FFT / KDTree implementation @abstractqqq mentioned and see if we can get a speedup.