Add null-aware cross correlation

Basically, implement a faster version of pandas df.corr(). numpy.corrcoef is fast, but propagates NaNs. Polars df.corr does the same (it just uses np.corrcoef under the hood).

import numpy as np
import pandas as pd
import polars as pl

N_ROWS = 10
N_COLS = 3

A = np.random.rand(N_ROWS, N_COLS)  # random matrix
A.ravel()[
    np.random.choice(A.size, N_ROWS * N_COLS // 10, replace=False)
] = np.nan  # randomly set entries to NaN

df = pd.DataFrame(A, columns=[f"f{i}" for i in range(N_COLS)])
pf = pl.from_pandas(df)
print(df.corr())
print(pf.corr())
print(pf.select(pl.corr("f0", "f1")))

print(df.corr())
          f0        f1        f2
f0  1.000000 -0.178174 -0.424737
f1 -0.178174  1.000000  0.274256
f2 -0.424737  0.274256  1.000000

print(pf.corr())
┌─────┬─────┬─────┐
│ f0  ┆ f1  ┆ f2  │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
│ NaN ┆ NaN ┆ NaN │
└─────┴─────┴─────┘

Basically, what pandas is doing is a loop over all combinations, dropping nulls between the two columns, and building the correlation matrix pair by pair. Note that you cannot drop nulls beforehand as the nulls between a and b may be different from a and c, and so on.

The plan is to wrap pl.corr(), which does handle nulls like pandas does. In the future, I would like to take a closer look at the FFT / KDTree implementation @abstractqqq mentioned and see if we can get a speedup.

abstractqqq / polars_ds_extension

Add null-aware cross correlation #176