Open jolars opened 2 years ago
You can use the X_density
parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data
You can use the
X_density
parameter in https://benchopt.github.io/generated/benchopt.datasets.simulated.make_correlated_data.html#benchopt.datasets.simulated.make_correlated_data
Thats great, thanks! But the current implementation doesn't really work when it comes to correlation + sparse, right?
Doesn't it ? We create X the standard way, then decimate it: https://github.com/benchopt/benchopt/blob/main/benchopt/datasets/simulated.py#L93
Since the decimation is iid and independent of X, it seems to me that the correlation matrix is just multiplied by X_density
, so the correlation structure is preserved.
No, I think think so since you're uniformly decimating it. If you have two columns for instance, and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right? See here:
import numpy as np
from benchopt.datasets.simulated import make_correlated_data
n = 10_000
p = 3
rho = 0.9
random_state = 1
X_density = 0.01
A, _, _ = make_correlated_data(n, p, random_state=random_state, rho=rho)
print(np.corrcoef(A.T))
#> [[1. 0.90090375 0.81264968]
#> [0.90090375 1. 0.9021679 ]
#> [0.81264968 0.9021679 1. ]]
B, _, _ = make_correlated_data(
n, p, random_state=random_state, rho=rho, X_density=X_density
)
print(np.corrcoef(B.T.toarray()))
#> [[1.00000000e+00 9.28859390e-05 3.68998428e-03]
#> [9.28859390e-05 1.00000000e+00 7.56397951e-03]
#> [3.68998428e-03 7.56397951e-03 1.00000000e+00]]
Let $\delta_i$ be the decimation Bernoulli variable, with expectation $\rho$. Note that $\delta_i = \delta_i^2$. I have:
$$E[ \sum \delta_i x_i \delta_i' x_i'] = \rho^2 E[\sum x_i x_i']$$ (works only when $\delta_i$ is indep from $\delta_i'$, ie we are looking at correlation between two different features.
while at the denominator: $$\sqrt{E[\sum \delta_i^2 x_i^2 ]} = \sqrt{\rho E[\sum_i x_i^2]}$$
So at the numerator you get a $\rho^2$ out, while at the denominator you get twice $\sqrt{\rho}$. Thus in total the new correlation is multiplied by $\rho$ outside the diagonal ?
Same snippet but with 1e6 samples:
## -- End pasted text --
[[1. 0.89992156 0.80996923]
[0.89992156 1. 0.8999785 ]
[0.80996923 0.8999785 1. ]]
[[1. 0.00910993 0.00873657]
[0.00910993 1. 0.00923818]
[0.00873657 0.00923818 1. ]]
so multiplication by 0.01 ($\rho$)
Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.
Right, but you don't get the nominal 0.9; isn't that what you want also when X is sparse?
On 24.05.2022 01:35, mathurinm wrote:
Same snippet but with 1e6 samples:
## -- End pasted text -- [[1. 0.89992156 0.80996923] [0.89992156 1. 0.8999785 ] [0.80996923 0.8999785 1. ]] [[1. 0.00910993 0.00873657] [0.00910993 1. 0.00923818] [0.00873657 0.00923818 1. ]]
so multiplication by 0.01 ($\rho$)
Regarding: "and only keep a single nonzero value in each of the columns, then it's very likely that these two values are going to be at two very different indices, right" yes, but once in a while (1 out of n_samples), the indices match and you get a non zero expectation.
-- Reply to this email directly or view it on GitHub: https://github.com/jolars/benchmark_lasso_path/issues/13#issuecomment-1135574257 You are receiving this because you authored the thread.
Message ID: @.***>
If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !
If you want independent supports from one column to the other (a legitimate assumption IMO) I suppose that it's not possible to have correlation higher than the column density, but if you find a way I'm interested !
Well... I guess that depends on what you consider the zeros to be. If you think they are values just like the non-zeros, then I don't see why the supports should be independent. If you consider them to be missing data completely at random, then sure it would not make sense to have it be correlated.
If we consider binary data instead, does that change things for you? Because there's of course a lot of very sparse data with binary values (e.g. microarray data) and highly correlated columns and you cannot simulate that type of data unless you allow the sparsity pattern to be correlated too.
Consider adding functions to simulate sparse data (binary X), with correlation structure, which should be useful when benchmarking in the p >> n regime.