Basically, implement a faster version of pandas df.corr(). numpy.corrcoef is fast, but propagates NaNs. Polars df.corr does the same (it just uses np.corrcoef under the hood).
import numpy as np
import pandas as pd
import polars as pl
N_ROWS = 10
N_COLS = 3
A = np.random.rand(N_ROWS, N_COLS) # random matrix
A.ravel()[
np.random.choice(A.size, N_ROWS * N_COLS // 10, replace=False)
] = np.nan # randomly set entries to NaN
df = pd.DataFrame(A, columns=[f"f{i}" for i in range(N_COLS)])
pf = pl.from_pandas(df)
print(df.corr())
print(pf.corr())
print(pf.select(pl.corr("f0", "f1")))
print(df.corr())
f0 f1 f2
f0 1.000000 -0.178174 -0.424737
f1 -0.178174 1.000000 0.274256
f2 -0.424737 0.274256 1.000000
print(pf.corr())
┌─────┬─────┬─────┐
│ f0 ┆ f1 ┆ f2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
└─────┴─────┴─────┘
Basically, what pandas is doing is a loop over all combinations, dropping nulls between the two columns, and building the correlation matrix pair by pair. Note that you cannot drop nulls beforehand as the nulls between a and b may be different from a and c, and so on.
The plan is to wrap pl.corr(), which does handle nulls like pandas does. In the future, I would like to take a closer look at the FFT / KDTree implementation @abstractqqq mentioned and see if we can get a speedup.
Basically, implement a faster version of pandas df.corr(). numpy.corrcoef is fast, but propagates NaNs. Polars df.corr does the same (it just uses np.corrcoef under the hood).
Basically, what pandas is doing is a loop over all combinations, dropping nulls between the two columns, and building the correlation matrix pair by pair. Note that you cannot drop nulls beforehand as the nulls between
a
andb
may be different froma
andc
, and so on.The plan is to wrap pl.corr(), which does handle nulls like pandas does. In the future, I would like to take a closer look at the FFT / KDTree implementation @abstractqqq mentioned and see if we can get a speedup.