feat: add simulated sparse data

jolars commented 2 years ago

Consider adding functions to simulate sparse data (binary X), with correlation structure, which should be useful when benchmarking in the p >> n regime.

mathurinm commented 2 years ago

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

jolars commented 2 years ago

You can use the X_density parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data

Thats great, thanks! But the current implementation doesn't really work when it comes to correlation + sparse, right?

mathurinm commented 2 years ago

Doesn't it ? We create X the standard way, then decimate it: https://github.com/benchopt/benchopt/blob/main/benchopt/datasets/simulated.py#L93

Since the decimation is iid and independent of X, it seems to me that the correlation matrix is just multiplied by X_density, so the correlation structure is preserved.

jolars commented 2 years ago

No, I think think so since you're uniformly decimating it. If you have two columns for instance, and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right? See here:

import numpy as np
from benchopt.datasets.simulated import make_correlated_data

n = 10_000
p = 3
rho = 0.9
random_state = 1
X_density = 0.01

A, _, _ = make_correlated_data(n, p, random_state=random_state, rho=rho)

print(np.corrcoef(A.T))
#> [[1.         0.90090375 0.81264968]
#>  [0.90090375 1.         0.9021679 ]
#>  [0.81264968 0.9021679  1.        ]]

B, _, _ = make_correlated_data(
    n, p, random_state=random_state, rho=rho, X_density=X_density
)

print(np.corrcoef(B.T.toarray()))
#> [[1.00000000e+00 9.28859390e-05 3.68998428e-03]
#>  [9.28859390e-05 1.00000000e+00 7.56397951e-03]
#>  [3.68998428e-03 7.56397951e-03 1.00000000e+00]]

mathurinm commented 2 years ago

Let $\delta_i$ be the decimation Bernoulli variable, with expectation $\rho$. Note that $\delta_i = \delta_i^2$. I have:

$$E[ \sum \delta_i x_i \delta_i' x_i'] = \rho^2 E[\sum x_i x_i']$$ (works only when $\delta_i$ is indep from $\delta_i'$, ie we are looking at correlation between two different features.

while at the denominator: $$\sqrt{E[\sum \delta_i^2 x_i^2 ]} = \sqrt{\rho E[\sum_i x_i^2]}$$

So at the numerator you get a $\rho^2$ out, while at the denominator you get twice $\sqrt{\rho}$. Thus in total the new correlation is multiplied by $\rho$ outside the diagonal ?

mathurinm commented 2 years ago

Same snippet but with 1e6 samples:

## -- End pasted text --
[[1.         0.89992156 0.80996923]
 [0.89992156 1.         0.8999785 ]
 [0.80996923 0.8999785  1.        ]]
[[1.         0.00910993 0.00873657]
 [0.00910993 1.         0.00923818]
 [0.00873657 0.00923818 1.        ]]

so multiplication by 0.01 ($\rho$)

Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.

jolars commented 2 years ago

Right, but you don't get the nominal 0.9; isn't that what you want also when X is sparse?

On 24.05.2022 01:35, mathurinm wrote:

Same snippet but with 1e6 samples:
## -- End pasted text --
[[1.         0.89992156 0.80996923]
[0.89992156 1.         0.8999785 ]
[0.80996923 0.8999785  1.        ]]
[[1.         0.00910993 0.00873657]
[0.00910993 1.         0.00923818]
[0.00873657 0.00923818 1.        ]]
so multiplication by 0.01 ($\rho$)

Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.

-- Reply to this email directly or view it on GitHub: https://github.com/jolars/benchmark_lasso_path/issues/13#issuecomment-1135574257 You are receiving this because you authored the thread.

Message ID: @.***>

mathurinm commented 2 years ago

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

jolars commented 2 years ago

If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !

Well... I guess that depends on what you consider the zeros to be. If you think they are values just like the non-zeros, then I don't see why the supports should be independent. If you consider them to be missing data completely at random, then sure it would not make sense to have it be correlated.

If we consider binary data instead, does that change things for you? Because there's of course a lot of very sparse data with binary values (e.g. microarray data) and highly correlated columns and you cannot simulate that type of data unless you allow the sparsity pattern to be correlated too.

benchopt / benchmark_lasso_path

feat: add simulated sparse data #13