abess-team / abess

Fast Best-Subset Selection Library
https://abess.readthedocs.io/
Other
474 stars 41 forks source link

Any way to work with preselected sets of variables in abess & with MBIC criterion? #519

Closed tomwenseleers closed 1 year ago

tomwenseleers commented 1 year ago

I was wondering if there is any way to tell abess the original problem size in case one is working with preselected sets of variables, preselected via some other method, so that all information criteria would be calculated in reference to the original problem size?

E.g. imagine we had a dataset with 500 observations & 1 million variables & we used MCP for pre-screening :

# DATASET ####
simulate_orthogonal = function(n=500, p=1000000, n_nonzero=50, beta_sd=1, error_sd=1) {
  # Generate design matrix X
  X <- matrix(rnorm(n*p), n, p)  
  # Generate beta vector
  beta <- c(rnorm(n_nonzero, 0, beta_sd), rep(0, p-n_nonzero))
  # Generate errors
  epsilon <- rnorm(n, 0, error_sd)
  # Generate y vector
  y <- X %*% beta + epsilon
  return(list(X=X, y=y, beta_true=beta))
}
set.seed(1)
n = 500; p=1000000
data_gaus = simulate_orthogonal(n=n, p=p, n_nonzero=50, beta_sd=1, error_sd=1)
beta_true = data_gaus$beta_true

# PRESCREENING VIA MCP PENALIZED REGRESSION IN ORDINIS ####
library(devtools)
devtools::install_github("jaredhuling/ordinis")
library(ordinis)
gamma_mcp = 1.1 # default for MCP is 1.4, I am using value close to 1 to match best with our L0 objective
system.time(ordinis_fit <- ordinis(X=data_gaus$X, y=data_gaus$y, intercept=F,
               penalty = "mcp",
               gamma = gamma_mcp, 
               nlambda = 2L, # for pre-screening I just need 1 solution, so nlambda=1 in fact also works OK
               dfmax = n, # max nr of allowed variables
               alpha = 1, # no elastic net penalty
               maxit = 5 # NOTE: can be set to just 1 iteration when used for pre-screening
               )) # 7s with maxit 5 (with nlambda 2), 64s with defaults (with nlambda 100) 
# plot(ordinis_fit)
ordinis_fit$lambda[which.max(logLik(ordinis_fit))] # best lambda, no point with nlambda=1
min(AIC(ordinis_fit)) #
min(BIC(ordinis_fit)) #
which.min(BIC(ordinis_fit)) # NOTE: this is the smallest lambda value when nlambda=100
# beta_est0= ordinis_fit$beta[,which.min(BIC(ordinis_fit))][-1]
# beta_est0= ordinis_fit$beta[,which.min(AIC(ordinis_fit))][-1]
beta_est0= ordinis_fit$beta[,which.max(logLik(ordinis_fit))][-1] # gives same result as AIC & BIC here
table(beta_est0!=0, beta_true!=0, 
      dnn=c("estimated beta nonzero", "true beta nonzero"))
#                            true beta nonzero
# estimated beta nonzero         FALSE   TRUE
#                         FALSE 999915      9
#                         TRUE      35     41
# 41 true pos but also 35 false pos still

# ABESS FIT ON MCP PRE-SELECTED VARIABLES ####
library(abess)
startcoefs = beta_est0
selcoefs = which(startcoefs!=0)
p_selcoefs = length(selcoefs)
system.time(fit_abessafterMCP_selcoefs <- abess(x = data_gaus$X[,selcoefs,drop=F], 
                                                  y = data_gaus$y, 
                                                  family = "gaussian",
                                                  tune.path = "sequence", 
                                                  support.size = c(1:(p_selcoefs-1)),
                                                  lambda = 0, 
                                                  warm.start = TRUE, 
                                                  tune.type = "ebic")) # 0.06s
plot(fit_abessafterMCP_selcoefs, type="tune") 
beta_abessafterMCP_selcoefs = as.matrix(extract(fit_abessafterMCP_selcoefs)$beta) 
beta_abessafterMCP = Matrix(rep(0, p), sparse=T)
beta_abessafterMCP[selcoefs] = beta_abessafterMCP_selcoefs
beta_abessafterMCP = as.vector(beta_abessafterMCP)
sum(beta_abessafterMCP!=0) # 43
table(beta_abessafterMCP!=0, beta_true!=0, 
      dnn=c("estimated beta nonzero", "true beta nonzero"))
#                         true beta nonzero
# estimated beta nonzero  FALSE   TRUE
#                  FALSE 999945     12
#                   TRUE      5     38

plot(x=beta_abessafterMCP[beta_true!=0&beta_abessafterMCP!=0], 
     y=beta_true[beta_true!=0&beta_abessafterMCP!=0], pch=16, col="blue",
     xlab="Estimated coefficients", ylab="True coefficients",
     main=paste0("abess best subset on MCP preselected variables")) # true positives
abline(a=0, b=1, col="lightgrey")
points(x=beta_abessafterMCP[beta_true!=0&beta_abessafterMCP==0], 
       y=beta_true[beta_true!=0&beta_abessafterMCP==0], 
       pch=16, col="purple") # false negatives
points(x=beta_abessafterMCP[beta_true==0&beta_abessafterMCP!=0], 
       y=beta_true[beta_true==0&beta_abessafterMCP!=0], 
       pch=16, col="red") # false positives
points(x=beta_abessafterMCP[beta_true==0&beta_abessafterMCP==0][[1]], 
       y=beta_true[beta_true==0&beta_abessafterMCP==0][[1]], 
       pch=16, col="green") # true negatives

5_abess_on_MCP_presel_vars_p_1_million_n_500_50 nonzero

In this example, only ebicworks here (still returning 5 false positives though), but it could be that this is due to the information criteria being calculated with respect to the subsetted problem size (with p=76) & not with respect to the original problem size (with p=1000000). Ideally I would also like to use some higher penalisation to get rid of the 5 false positives. I thought I could perhaps simulate abessusing MBIC by specifying "aic" as tune.type but using ic.scale = (1/2)*log(n*(p^2)/16) so that 2 (the penalty factor implied by AIC multiplied by ic.scale would return the penalty implied by MBIC, but that didn't seem to work. Any thoughts how I could do something like this? I.e. penalize with respect to the original problem size and using the specific IC I am interested in (in this case MBIC)? Maybe mbiccould also be allowed by default? And p could be allowed to be passed as an option to specify the original problem size?

Running abess on the original matrix X with MCP selected variables as initial active set is also possible, but that's much slower then of course. In that sense, it might also be nice to support MCP-based pre-screening of variables for very high dimensional problems. An option allowing not to expand the initial active set if you pass it might also be a possibility, that might perhaps also be a good solution :

system.time(fit_abess <- abess(x = X, 
                                                y = y, 
                                                family = "gaussian",
                                                tune.path = "sequence", 
                                                support.size = c(1:(p_selcoefs-1)),
                                                lambda = 0,
                                                # init.active.set = as.integer(which(startcoefs!=0)), # strangely enough adding this increased runtime to 55s, even though the initial solution should be better, solution the same in both cases though
                                                warm.start = TRUE, 
                                                tune.type = "ebic")) # 48s
plot(fit_abess, type="tune")  # optimal support size 
beta_abess = as.vector(extract(fit_abess)$beta)/scaling 
sum(beta_abess!=0) # 33
table(beta_abess!=0, beta_true!=0, 
      dnn=c("estimated beta nonzero", "true beta nonzero"))
# true beta nonzero
# estimated beta nonzero  FALSE   TRUE
#                  FALSE 999950     17
#                  TRUE       0     33

plot(x=beta_abess[beta_true!=0&beta_abess!=0], 
     y=beta_true[beta_true!=0&beta_abess!=0], pch=16, col="blue",
     xlab="Estimated coefficients", ylab="True coefficients",
     main=paste0("abess best subset fit")) # true positives
abline(a=0, b=1, col="lightgrey")
points(x=beta_abess[beta_true!=0&beta_abess==0], 
       y=beta_true[beta_true!=0&beta_abess==0], 
       pch=16, col="purple") # false negatives
points(x=beta_abess[beta_true==0&beta_abess!=0], 
       y=beta_true[beta_true==0&beta_abess!=0], 
       pch=16, col="red") # false positives
points(x=beta_abess[beta_true==0&beta_abess==0][[1]], 
       y=beta_true[beta_true==0&beta_abess==0][[1]], 
       pch=16, col="green") # true negatives

7_abess_fit_p_1_million_n_500_50 nonzero

Mamba413 commented 1 year ago

@tomwenseleers thanks for this question. We are working on it now. I not pretty sure what is MBIC you mentioned?Can you provide any reference?

tomwenseleers commented 1 year ago

Many thanks for that! The specific version of mBIC I was using was cited in Frommlet & Nuel (2016) & I was using their choice of c=4, which is actually a hyperparameter: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0148620&type=printable But it's a bit confusing since there is several versions of mBIC available: https://onlinelibrary.wiley.com/doi/epdf/10.1002/qre.936 https://link.springer.com/chapter/10.1007/978-3-642-29210-1_39

And there is still a whole zoo of other IC that have been suggested, e.g. (hope I am getting these formulae right)

hq = min2LL + c*log(log(n))*edf : the Hannan and Quinnn information criterion
ric = min2LL + 2 * log(p) # risk inflation criterion
mric = min2LL + 2 * sum(log(p/(1:edf))) # modified risk inflation criterion
cic = min2LL + 4 * sum(log(p/(1:edf))) # covariance inflation criterion
bicg =  min2LL + log(n)*edf + 2*g*lchoose(p,round(edf)) # g =1 suggested as default
# (https://www.proquest.com/openview/918b8b1efc7e0a0aa4d565ed54fa37dd/1?cbl=18750&diss=y&pq-origsite=gscholar)
bicq =  min2LL + log(n)*edf - 2*edf*log(q/(1-q))
# (see Xu, C. and McLeod, A.I. (2009). Bayesian Information Criterion with Bernouilli Prior. and https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=cb9a4547704f6a401116e04263e0445feab10cba)

For aic, bic, gic, ebic & mbic I was using

  aic = min2LL + 2 * edf # Akaike information criterion
  # aicc = ifelse(n > edf, min2LL + 2*edf*n/(n-edf), NA) # small sample AIC
  bic = min2LL + log(n) * edf # Bayesian information criterion
  gic = min2LL + log(p) * log(log(n)) * edf # generalized information criterion GIC = SIC in https://www.pnas.org/doi/10.1073/pnas.2014241117
  ebic = min2LL + (log(n) + 2 * (1 - log(n) / (2 * log(p))) * log(p)) * edf # extended BIC, Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759-771., https://arxiv.org/abs/1107.2502 (note original still has an additional tuning parameter)
  mbic = min2LL + log(n * (p ^ 2) / 16) * edf # see Frommlet & Nuel 2016

For gic several versions have also been suggested though, so gic as a name is in fact a little ambiguous.

No idea though which ones are now in general recommended to achieve either optimal predictive performance or optimal variable selection consistency for either the n>p or p>n setting... Myself I use for n > p AIC & BIC when I am interested in optimal predictive performance (as optimising AIC asymptotically is equivalent to minimising leave one out cross validation error, Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7) and optimal variable selection consistency (Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.), respectively, and for p > n I was using mBIC for variable selection (with c=4, following Frommlet & Nuel) - but not sure what's best to get optimal predictive performance when p >> n. What IC do you find generally perform best for either purpose? Some of the IC also have a hyperparameter related to the actual nr of variables that you think have nonzero coefficients, which also makes sense.

Maybe an easy way to support any of these IC could be to allow for some argument ic.factor, which the user could set in function of the true p & n of the original problem him/herself & any other hyperparrameters. So that if you would pass 2 or log(n) or log(n * (p ^ 2) / 16) that it would correspond to AIC, BIC or mBIC etc? Maybe that could be used instead of argument ic.scale, which I find a little ambiguous in terms of what exactly it does. So that would address both the support of other alternative IC as well as allowing passing the correct penalization in case the variables have already been subsetting via another method.

Note that the leave one out cross validation error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation (see also https://www.efavdb.com/leave-one-out-cross-validation). This would always be an alternative to the AIC, as that is only an asymptotic approximation of the LOOCV error. I suppose that would also work for generalized linear models if one works on the adjusted z scale of the GLM and if one uses the working observation weights.

bbayukari commented 1 year ago

Thanks for your code, I have run the experiment with different ICs.

First of all, for this question:

"I thought I could perhaps simulate abess using MBIC by specifying "aic" as tune.type but using ic.scale = (1/2)log(n(p^2)/16) so that 2 (the penalty factor implied by AIC multiplied by ic.scale would return the penalty implied by MBIC, but that didn't seem to work. "

This idea is almost correct, but ic.scale won't be used when tune.type is "aic" because we think ic.scale of "aic" must be a constant so that nobody needs to modify it. We can specify "bic" as tune.type but using ic.scale = log(n*(p^2)/16) / log(n) to implement MBIC.

The results are the num of true positive variables and the size of support-set considered by the algorithm under 5 ICs and subsetted/original dim settings.

 TP/ best.size| p=76 | p=1000000 -- | -- | -- aic | 41/76 | 41/76 bic | 41/76 | 41/76 gic | 41/76 | 34/34 ebic | 38/43 | 33/33 mbic | 41/76 | 33/33
tomwenseleers commented 1 year ago

Ha many thanks - great! So what are the numbers after the slash? The nr of true positives is the nr before the slash, but what is the second? Is 76 not always the set considered by the algorithm, given that that's the nr of MCP preselected variables? Or are some variables kicked out from the very start by the algo, with that nr being dependent on the penalization?

So would you say GIC works best based on this? But what were the nr of false positives, as I imagine that would be far too high with AIC and maybe also with GIC?

Maybe slightly counterintuitive that ic.scale would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale would be set at 1/2 and tune.type="aic" the penalisation would be half as strong as implied by AIC?

bbayukari commented 1 year ago

Let's take '41/76' as an example to explain the result. Its corresponding confusion matrix is

estimated/true FALSE TRUE
FALSE 999915 9
TRUE 35 41

'76' refers that the algorithm believes that the size of support-set is best at 76. The fact that best.size equals the total num of variables implies that the penalty of IC is too little to select correct variables.

The results above implies that using true dim (p=1e6) can increase the penalty so that improve the precision except for AIC and BIC.

Maybe slightly counterintuitive that ic.scale would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale would be set at 1/2 and tune.type="aic" the penalisation would be half as strong as implied by AIC?

Thanks for your suggestion, we will align the behavior of AIC with other ICs soon.

tomwenseleers commented 1 year ago

Ha OK makes sense! So AIC & BIC always provide insufficient penalisation for the high dimensional case, even when working with the original problem size, while GIC here seems best - giving 34 true pos & 0 false pos when working with the original problem size! That's cool! Amazes me that you can pick out 34 true positives from a set of 1 million possible variables with zero false positives, even when the effect size of many of the selected variables is relatively modest in terms of Cohen's d. If you use "gic" on the full dataset without MCP preselection I noticed abess selects 33 true positives (0 false positives). So it seems both gic & ebic work here... That runs in 56s on my laptop but only if you specify support.size = c(1:76) - c(1:(n-1)) would be much much slower... And if you specify the max support size of the MCP fit one might of course as well use that as the initial active set or subset to those variables...

bbayukari commented 1 year ago

I'm sorry that the experimental results were not clear enough and may cause a misunderstanding.

 TP/ best.size| p=76 | p=1000000 -- | -- | -- aic | 41/76 | 41/76 bic | 41/76 | 41/76 gic | 41/76 | 34/34 ebic | 38/43 | 33/33 mbic | 41/76 | 33/33

All the experiments used MCP for pre-screening, so abess only selected some of the 76 variables using different ICs. The p in the first row of the table refers to the total number of variables p in the IC, that is,

the information criteria being calculated with respect to the subsetted problem size (with p=76) & not with respect to the original problem size (with p=1000000).

So, AIC and BIC aren't able to provide sufficient penalties in this case. EBIC is the best one without correcting the value of p. After correcting, EBIC, GIC, MBIC give better and similar results.

It implies IC needs to be corrected after pre-screening. May I ask if there is anything else need to discuss?

tomwenseleers commented 1 year ago

No thanks that's clear! I'll close this then!