Closed tomwenseleers closed 1 year ago
@tomwenseleers thanks for this question. We are working on it now. I not pretty sure what is MBIC you mentioned?Can you provide any reference?
Many thanks for that! The specific version of mBIC I was using was cited in Frommlet & Nuel (2016) & I was using their choice of c=4, which is actually a hyperparameter: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0148620&type=printable But it's a bit confusing since there is several versions of mBIC available: https://onlinelibrary.wiley.com/doi/epdf/10.1002/qre.936 https://link.springer.com/chapter/10.1007/978-3-642-29210-1_39
And there is still a whole zoo of other IC that have been suggested, e.g. (hope I am getting these formulae right)
hq = min2LL + c*log(log(n))*edf : the Hannan and Quinnn information criterion
ric = min2LL + 2 * log(p) # risk inflation criterion
mric = min2LL + 2 * sum(log(p/(1:edf))) # modified risk inflation criterion
cic = min2LL + 4 * sum(log(p/(1:edf))) # covariance inflation criterion
bicg = min2LL + log(n)*edf + 2*g*lchoose(p,round(edf)) # g =1 suggested as default
# (https://www.proquest.com/openview/918b8b1efc7e0a0aa4d565ed54fa37dd/1?cbl=18750&diss=y&pq-origsite=gscholar)
bicq = min2LL + log(n)*edf - 2*edf*log(q/(1-q))
# (see Xu, C. and McLeod, A.I. (2009). Bayesian Information Criterion with Bernouilli Prior. and https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=cb9a4547704f6a401116e04263e0445feab10cba)
For aic, bic, gic, ebic & mbic I was using
aic = min2LL + 2 * edf # Akaike information criterion
# aicc = ifelse(n > edf, min2LL + 2*edf*n/(n-edf), NA) # small sample AIC
bic = min2LL + log(n) * edf # Bayesian information criterion
gic = min2LL + log(p) * log(log(n)) * edf # generalized information criterion GIC = SIC in https://www.pnas.org/doi/10.1073/pnas.2014241117
ebic = min2LL + (log(n) + 2 * (1 - log(n) / (2 * log(p))) * log(p)) * edf # extended BIC, Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759-771., https://arxiv.org/abs/1107.2502 (note original still has an additional tuning parameter)
mbic = min2LL + log(n * (p ^ 2) / 16) * edf # see Frommlet & Nuel 2016
For gic several versions have also been suggested though, so gic as a name is in fact a little ambiguous.
No idea though which ones are now in general recommended to achieve either optimal predictive performance or optimal variable selection consistency for either the n>p or p>n setting... Myself I use for n > p AIC & BIC when I am interested in optimal predictive performance (as optimising AIC asymptotically is equivalent to minimising leave one out cross validation error, Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7) and optimal variable selection consistency (Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.), respectively, and for p > n I was using mBIC for variable selection (with c=4, following Frommlet & Nuel) - but not sure what's best to get optimal predictive performance when p >> n. What IC do you find generally perform best for either purpose? Some of the IC also have a hyperparameter related to the actual nr of variables that you think have nonzero coefficients, which also makes sense.
Maybe an easy way to support any of these IC could be to allow for some argument ic.factor
, which the user could set in function of the true p & n of the original problem him/herself & any other hyperparrameters. So that if you would pass 2 or log(n) or log(n * (p ^ 2) / 16) that it would correspond to AIC, BIC or mBIC etc? Maybe that could be used instead of argument ic.scale
, which I find a little ambiguous in terms of what exactly it does. So that would address both the support of other alternative IC as well as allowing passing the correct penalization in case the variables have already been subsetting via another method.
Note that the leave one out cross validation error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation (see also https://www.efavdb.com/leave-one-out-cross-validation). This would always be an alternative to the AIC, as that is only an asymptotic approximation of the LOOCV error. I suppose that would also work for generalized linear models if one works on the adjusted z scale of the GLM and if one uses the working observation weights.
Thanks for your code, I have run the experiment with different ICs.
First of all, for this question:
"I thought I could perhaps simulate abess using MBIC by specifying "aic" as tune.type but using ic.scale = (1/2)log(n(p^2)/16) so that 2 (the penalty factor implied by AIC multiplied by ic.scale would return the penalty implied by MBIC, but that didn't seem to work. "
This idea is almost correct, but ic.scale
won't be used when tune.type
is "aic" because we think ic.scale
of "aic" must be a constant so that nobody needs to modify it. We can specify "bic" as tune.type
but using ic.scale
= log(n*(p^2)/16) / log(n) to implement MBIC.
The results are the num of true positive variables and the size of support-set considered by the algorithm under 5 ICs and subsetted/original dim settings.
TP/ best.size| p=76 | p=1000000 -- | -- | -- aic | 41/76 | 41/76 bic | 41/76 | 41/76 gic | 41/76 | 34/34 ebic | 38/43 | 33/33 mbic | 41/76 | 33/33Ha many thanks - great! So what are the numbers after the slash? The nr of true positives is the nr before the slash, but what is the second? Is 76 not always the set considered by the algorithm, given that that's the nr of MCP preselected variables? Or are some variables kicked out from the very start by the algo, with that nr being dependent on the penalization?
So would you say GIC works best based on this? But what were the nr of false positives, as I imagine that would be far too high with AIC and maybe also with GIC?
Maybe slightly counterintuitive that ic.scale
would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale
would be set at 1/2 and tune.type="aic"
the penalisation would be half as strong as implied by AIC?
Let's take '41/76' as an example to explain the result. Its corresponding confusion matrix is
estimated/true | FALSE | TRUE |
---|---|---|
FALSE | 999915 | 9 |
TRUE | 35 | 41 |
'76' refers that the algorithm believes that the size of support-set is best at 76. The fact that best.size
equals the total num of variables implies that the penalty of IC is too little to select correct variables.
The results above implies that using true dim (p=1e6) can increase the penalty so that improve the precision except for AIC and BIC.
Maybe slightly counterintuitive that ic.scale would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale would be set at 1/2 and tune.type="aic" the penalisation would be half as strong as implied by AIC?
Thanks for your suggestion, we will align the behavior of AIC with other ICs soon.
Ha OK makes sense! So AIC & BIC always provide insufficient penalisation for the high dimensional case, even when working with the original problem size, while GIC here seems best - giving 34 true pos & 0 false pos when working with the original problem size! That's cool! Amazes me that you can pick out 34 true positives from a set of 1 million possible variables with zero false positives, even when the effect size of many of the selected variables is relatively modest in terms of Cohen's d. If you use "gic" on the full dataset without MCP preselection I noticed abess selects 33 true positives (0 false positives). So it seems both gic & ebic work here... That runs in 56s on my laptop but only if you specify support.size = c(1:76) - c(1:(n-1)) would be much much slower... And if you specify the max support size of the MCP fit one might of course as well use that as the initial active set or subset to those variables...
I'm sorry that the experimental results were not clear enough and may cause a misunderstanding.
TP/ best.size| p=76 | p=1000000 -- | -- | -- aic | 41/76 | 41/76 bic | 41/76 | 41/76 gic | 41/76 | 34/34 ebic | 38/43 | 33/33 mbic | 41/76 | 33/33All the experiments used MCP for pre-screening, so abess only selected some of the 76 variables using different ICs. The p
in the first row of the table refers to the total number of variables p
in the IC, that is,
the information criteria being calculated with respect to the subsetted problem size (with p=76) & not with respect to the original problem size (with p=1000000).
So, AIC and BIC aren't able to provide sufficient penalties in this case. EBIC is the best one without correcting the value of p. After correcting, EBIC, GIC, MBIC give better and similar results.
It implies IC needs to be corrected after pre-screening. May I ask if there is anything else need to discuss?
No thanks that's clear! I'll close this then!
I was wondering if there is any way to tell abess the original problem size in case one is working with preselected sets of variables, preselected via some other method, so that all information criteria would be calculated in reference to the original problem size?
E.g. imagine we had a dataset with 500 observations & 1 million variables & we used MCP for pre-screening :
In this example, only
ebic
works here (still returning 5 false positives though), but it could be that this is due to the information criteria being calculated with respect to the subsetted problem size (with p=76) & not with respect to the original problem size (with p=1000000). Ideally I would also like to use some higher penalisation to get rid of the 5 false positives. I thought I could perhaps simulateabess
using MBIC by specifying "aic" astune.type
but usingic.scale = (1/2)*log(n*(p^2)/16)
so that 2 (the penalty factor implied by AIC multiplied byic.scale
would return the penalty implied by MBIC, but that didn't seem to work. Any thoughts how I could do something like this? I.e. penalize with respect to the original problem size and using the specific IC I am interested in (in this case MBIC)? Maybembic
could also be allowed by default? Andp
could be allowed to be passed as an option to specify the original problem size?Running abess on the original matrix X with MCP selected variables as initial active set is also possible, but that's much slower then of course. In that sense, it might also be nice to support MCP-based pre-screening of variables for very high dimensional problems. An option allowing not to expand the initial active set if you pass it might also be a possibility, that might perhaps also be a good solution :