Open tomwenseleers opened 1 year ago
Thank you for your constructive comment!
I have quickly read some statistical inference papers on high-dimensional variable selection. One method is based on the residual bootstrap method (DOI: 10.1198/jasa.2011.tm10159), which can provide a confidence interval for a sparse regression coefficient estimate.
Another method aims to find a set of variable sets that can cover the true effective variable set with high confidence (https://arxiv.org/pdf/2209.09299.pdf). The bootstrap method you mentioned is also discussed in Section 4.1 of this paper (https://arxiv.org/pdf/2209.09299.pdf). If you are interested, we would be happy to share our code with you once we have successfully implemented it.
Many thanks! If you would like to go have inference based on bootstrapping, one possibility could perhaps also be to register your package against the marginalmeans package, that also has some bootstrapping based inference implemented, https://vincentarelbundock.github.io/marginaleffects/articles/bootstrap.html#bootstrap, and it also has an extension mechanism, https://vincentarelbundock.github.io/marginaleffects/articles/extensions.html, and the author is always quite happy to implement extra features... Could be nice if you would like to allow users to be able to calculate posthoc contrasts and stuff like that... If all you would like to show is the significance of each individual coefficient from zero then what you suggest above would be enough of course.
I have read the documentation for the marginaleffect
package (https://vincentarelbundock.github.io). It is a great package that provides detailed tutorials. I also largely agree that we can implement bootstrapping using this package. However, I am not entirely certain whether calculating post-hoc contrasts is of great practical help (I apologize if this is due to my lack of familiarity with this field). Our team will discuss this point internally, which may take some time, but we will get back to you as soon as possible.
Linking this post here which may be relevant: https://stats.stackexchange.com/questions/617990/inference-for-high-dimensional-models-based-on-running-a-glm-on-the-union-of-s?noredirect=1#comment1148780_617990 and some quick tests on that idea of doing inference by fitting a regular GLM on the union of selected variables over many bootstrapped datasets: https://github.com/tomwenseleers/L0glm/blob/master/benchmarks/benchmarks%20multichannel%20deconvolution%20gaussian%20L0glm%20abess%20glmnet.R. In the end what works best is to select a set of variables that appear in at least 30% of the best subset fits on bootstrapped data, similar to what is done in the bolasso
to achieve variable selection consistency (https://arxiv.org/abs/0804.1302) (but where the threshold is typically taken much higher, e.g. at 0.9, leading to slower convergence to the true set). Such an approach might be convenient to avoid having to deal interfacing the package to marginaleffects
and emmeans
and such. And maybe you are right that the people that are fitting models with 10s of thousands of variables will typically not be the ones that will start making effect plots of particular factors in their model & that basic inference on the model coefficients is probably sufficient. For those doing best subset selection on models with 100s of variables & best subsets with less than 10 contributing variables or so things might be different. Or cases where the solution is extremely sparse - datasets with 10s of thousands of variables but still only a couple of contributing variables.
Thank you for your suggestion regarding statistical inference in abess. We found the methods you mentioned to be relatively effective. Additionally, we have developed a method that combines abess with the technique proposed in (https://arxiv.org/pdf/2209.09299.pdf), and we have obtained promising results. We have attached a screenshot of the output for your reference.
Furthermore, we would like to share our code with you. We hope this will be useful for users who require advanced statistical inference features. Thank you once again for your valuable input.
## library abess
library(abess)
## functions
# The following is a function to determine whether a column vector is in a matrix.
# If the column vector is in the matrix, this function returns its position.
vec_in_matrix <- function(mat, vec) {
k_max <- ncol(mat)
flag <- 0
k <- 1
while(!flag & k <= k_max) {
if(!sum(mat[, k] != vec)) flag <- k
k <- k + 1
}
return(flag)
}
# Below is a function that takes input samples and returns a set of selected variables
# that with high probability contains the actual support set.
# The selected variables are stored in the output matrix as column vectors.
Search_Candidate_Models <- function(x, y, cyc=100, tune="cv") {
n <- length(y)
p <- ncol(x)
Sd_abess <- matrix(2, p, 1)
for(i in 1:cyc) {
x_rs <- cbind(x, rnorm(n))
abess_fit <- abess(x_rs, y, always.include = p + 1, tune.type = tune)
beta_j <- as.numeric(as.matrix(abess_fit$beta[-(p + 1), paste(abess_fit$best.size)]) != 0)
if(!vec_in_matrix(Sd_abess, beta_j)) Sd_abess <- cbind(Sd_abess, beta_j)
}
Sd_abess <- Sd_abess[, -1]
return(as.matrix(Sd_abess))
}
# Below is a function that computes the p-value for each support set in the candidate models.
p_value_of_tua <- function(x, y, tua, cyc_p_value=100, tune="cv") {
n <- length(y)
p <- ncol(x)
if(!sum(tua)) return(0)
x_tua <- x[, tua != 0]
H <- x_tua %*% solve(t(x_tua) %*% x_tua) %*% t(x_tua)
a_obs <- H %*% y
b_obs <- sqrt(sum(((diag(n) - H) %*% y) ^ 2))
size_tua <- sum(tua)
Tua_abess <- matrix(2, p, 1)
times_Tua_abess <- c()
abess_tua <- as.numeric(as.vector(abess(x, y, support.size = size_tua, tune.type = tune)$beta) != 0)
for(j in 1:cyc_p_value) {
u_j <- (diag(n) - H) %*% rnorm(n)
y_j <- a_obs + b_obs / sqrt(sum((u_j) ^ 2)) * u_j
abess_tua_j <- as.numeric(as.vector(abess(x, y_j, support.size = size_tua, tune.type = tune)$beta) != 0)
where_tua <- vec_in_matrix(Tua_abess, abess_tua_j)
if(where_tua) {
times_Tua_abess[where_tua - 1] <- times_Tua_abess[where_tua - 1] + 1
} else {
Tua_abess <- cbind(Tua_abess, abess_tua_j)
times_Tua_abess <- c(times_Tua_abess, 1)
}
}
where_tua <- vec_in_matrix(Tua_abess, abess_tua)
if(where_tua) {
p_value_abess <- sum(times_Tua_abess[times_Tua_abess <= times_Tua_abess[where_tua - 1]]) / cyc_p_value
} else {
p_value_abess <- 0
}
return(p_value_abess)
}
# Below is a function that obtains the confidence set for support sets with confidence level of 1-alpha.
# The support sets are stored in the output matrix as column vectors.
Confidence_set <- function(x, y, Sd, alpha=0.05, cyc_p_value=100, tune="cv") {
gam <- matrix(2, p, 1)
for(i in 1:ncol(Sd)) {
if(p_value_of_tua(x, y, Sd[, i], cyc_p_value = cyc_p_value, tune = tune) >= alpha) {
gam <- cbind(gam, Sd[, i])
}
}
gam <- gam[, -1]
return(as.matrix(gam))
}
# The following is the function to calculate the confidence interval of the i-th variable.
confidence_interval <- function(x, y, Sd, i, alpha=0.05) {
ci <- NULL
set <- which(Sd[i,] != 0)
rname <- paste0("`", i, "`")
for(j in 1:length(set)) {
df <- as.data.frame(cbind(y, x[, Sd[, set[j]] != 0]))
colnames(df) <- c("y", paste(which(Sd[, set[j]] != 0)))
lm_model <- lm(y ~ ., data = df)
fit <- confint(lm_model, level = 1 - alpha)
if(is.null(ci)) {
ci <- fit[rname, ]
} else {
if(fit[rname, 1] < ci[1]) ci[1] <- fit[rname, 1]
if(fit[rname, 2] > ci[2]) ci[2] <- fit[rname, 2]
}
}
if(is.null(ci)) ci <- c(0, 0)
return(ci)
}
## Next is an example.
## import abess
library(abess)
## functions
# The following is a function to determine whether a column vector is in a matrix.
# If the column vector is in the matrix, this function returns its position.
vec_in_matrix <- function(mat, vec) {
k_max <- ncol(mat)
flag <- 0
k <- 1
while(!flag & k <= k_max) {
if(!sum(mat[, k] != vec)) flag <- k
k <- k + 1
}
return(flag)
}
# Below is a function that takes input samples and returns a set of selected variables
# that with high probability contains the actual support set.
# The selected variables are stored in the output matrix as column vectors.
Search_Candidate_Models <- function(x, y, cyc=100, tune="cv") {
n <- length(y)
p <- ncol(x)
Sd_abess <- matrix(2, p, 1)
for(i in 1:cyc) {
x_rs <- cbind(x, rnorm(n))
abess_fit <- abess(x_rs, y, always.include = p + 1, tune.type = tune)
beta_j <- as.numeric(as.matrix(abess_fit$beta[-(p + 1), paste(abess_fit$best.size)]) != 0)
if(!vec_in_matrix(Sd_abess, beta_j)) Sd_abess <- cbind(Sd_abess, beta_j)
}
Sd_abess <- Sd_abess[, -1]
return(as.matrix(Sd_abess))
}
# Below is a function that computes the p-value for each support set in the candidate models.
p_value_of_tua <- function(x, y, tua, cyc_p_value=100, tune="cv") {
n <- length(y)
p <- ncol(x)
if(!sum(tua)) return(0)
x_tua <- x[, tua != 0]
H <- x_tua %*% solve(t(x_tua) %*% x_tua) %*% t(x_tua)
a_obs <- H %*% y
b_obs <- sqrt(sum(((diag(n) - H) %*% y) ^ 2))
size_tua <- sum(tua)
Tua_abess <- matrix(2, p, 1)
times_Tua_abess <- c()
abess_tua <- as.numeric(as.vector(abess(x, y, support.size = size_tua, tune.type = tune)$beta) != 0)
for(j in 1:cyc_p_value) {
u_j <- (diag(n) - H) %*% rnorm(n)
y_j <- a_obs + b_obs / sqrt(sum((u_j) ^ 2)) * u_j
abess_tua_j <- as.numeric(as.vector(abess(x, y_j, support.size = size_tua, tune.type = tune)$beta) != 0)
where_tua <- vec_in_matrix(Tua_abess, abess_tua_j)
if(where_tua) {
times_Tua_abess[where_tua - 1] <- times_Tua_abess[where_tua - 1] + 1
} else {
Tua_abess <- cbind(Tua_abess, abess_tua_j)
times_Tua_abess <- c(times_Tua_abess, 1)
}
}
where_tua <- vec_in_matrix(Tua_abess, abess_tua)
if(where_tua) {
p_value_abess <- sum(times_Tua_abess[times_Tua_abess <= times_Tua_abess[where_tua - 1]]) / cyc_p_value
} else {
p_value_abess <- 0
}
return(p_value_abess)
}
# Below is a function that obtains the confidence set for support sets with confidence level of 1-alpha.
# The support sets are stored in the output matrix as column vectors.
Confidence_set <- function(x, y, Sd, alpha=0.05, cyc_p_value=100, tune="cv") {
gam <- matrix(2, p, 1)
for(i in 1:ncol(Sd)) {
if(p_value_of_tua(x, y, Sd[, i], cyc_p_value = cyc_p_value, tune = tune) >= alpha) {
gam <- cbind(gam, Sd[, i])
}
}
gam <- gam[, -1]
return(as.matrix(gam))
}
# The following is the function to calculate the confidence interval of the i-th variable.
confidence_interval <- function(x, y, Sd, i, alpha=0.05) {
ci <- NULL
set <- which(Sd[i,] != 0)
rname <- paste0("`", i, "`")
for(j in 1:length(set)) {
df <- as.data.frame(cbind(y, x[, Sd[, set[j]] != 0]))
colnames(df) <- c("y", paste(which(Sd[, set[j]] != 0)))
lm_model <- lm(y ~ ., data = df)
fit <- confint(lm_model, level = 1 - alpha)
if(is.null(ci)) {
ci <- fit[rname, ]
} else {
if(fit[rname, 1] < ci[1]) ci[1] <- fit[rname, 1]
if(fit[rname, 2] > ci[2]) ci[2] <- fit[rname, 2]
}
}
if(is.null(ci)) ci <- c(0, 0)
return(ci)
}
## Next is an example.
n <- 20
p <- 100
support.size <- 3
alpha <- 0.05
dataset <- generate.data(n, p, support.size)
Sd <- Search_Candidate_Models(dataset[["x"]], dataset[["y"]])
cat("The size of the candidate set:", ncol(Sd),
"\nWhether the actual support set is in the candidate set:", vec_in_matrix(Sd, dataset[["beta"]] != 0) != 0)
gam <- Confidence_set(dataset[["x"]], dataset[["y"]], Sd, alpha)
cat("\nThe size of the confidence set:", ncol(gam),
"\nWhether the actual support set is in the confidence set:", vec_in_matrix(gam, dataset[["beta"]] != 0) != 0)
active_set <- which(dataset[["beta"]] != 0)
for(i in 1:support.size) {
ci <- confidence_interval(dataset[["x"]], dataset[["y"]], Sd, active_set[i], alpha = alpha)
cat("\nThe value of the ", active_set[i], "-th variable:", dataset[["beta"]][active_set[i]],
"\tThe confidence interval of this variable:[", ci[1], ",", ci[2], "]")
}
When will confidence intervals be available? I showed results to a heart disease clinician and she asked me about them.
Thank you for your inquiry. It is an excellent question. In the mentioned algorithm, we first select candidate set with a support set as its foundation, ensuring that it likely includes the true support set (corresponding to the function Search_Candidate_Models
in the code above). Then, for each support set in the candidate set, confidence intervals for the parameters are calculated, and the union of these intervals is taken (corresponding to the function Confidence_set
in the code above). The confidence intervals can be guaranteed with a certain level of confidence.
Do you think it might also be an option to calculate the variance covariance matrix from the Fisher information matrix calculated for the selected variables, and repeat this procedure for fits on bootstrapped data (e.g. 20x), and then pad the vcov matrices with zeroes for any variables not selected in the particular fit, after which we could calculate the average vcov matrix from the average of these sparse vcov matrices. And then calculate SEs and CIs on the average coefs over all bootstrapped datasets from this?
Hi @tomwenseleers, thank you for your input. Could you please provide corresponding statistical references for the method you mentioned above? Perhaps we should first thoroughly review the literature before providing you with a more definitive response.
@Mamba413 No I haven't seen this exact method being used - perhaps mainly because until recently we didn't have access to good L0 penalized GLM fitting engines.... But it would just seem like a very simple method that logically speaking could work - but it would need to be verified numerically if this procedure would end up having the correct coverage. So the method would entail that for every abess fit on bootstrapped datasets one would calculate the Fisher Information Matrix for the selected variables as FIM=(1/dispersion)Xw'Xw where Xw=sqrt(weights)X with weights=working weights, after which the variance-covariance matrix would be vcov=solve(FIM) and the SE coefficients on the link scale would be sqrt(diag(vcov)). If in any fit some variables are not selected I would think that the covariance with those variables should then be estimated as zero. If there was no uncertainty over the actual true support one could just do this for the selected variables and get away not averaging the vcov matrices calculated from fits on bootstrapped data. The SEs and CIs and p values would then be conditional on the inferred support being correct & the approach would amount to just doing a GLM on the abess selected subset. If there is uncertainty over the true support then averaging the vcov matrices would seem like a logical way to take into account that uncertainty (it would assume that with a certain probability the covariance with a given parameter might be estimated as zero while in other cases would be estimated being positive). Question of course is how many bootstrap replicates one would have to do - perhaps that could be guided by the point at which the set size of the union of selected variables would start to reach an asymptote? Plain bootstrapping would of course also be possible - just doing your abess fit 1000 times. But that would be slow & with an approach like I mention here I could imagine one could get away maybe just doing 20 bootstrap replicates... Maybe a quick numeric test could work to see if this method would work & would give you approx. the correct coverage...
@tomwenseleers Thank you for providing us with a comprehensive explanation of your proposed method. After conducting a thorough search based on your description, we were unable to find any relevant literature discussing this approach. Our team has extensively deliberated on this internally, and ultimately, we have decided not to allocate additional resources for the implementation and maintenance of this feature. This decision is primarily due to the fact that we often find ourselves needing to respond to user inquiries regarding the feasibility of such methods, yet our understanding in this regard remains somewhat unclear.
Considering that the implementation of this method does not pose significant challenges, we wholeheartedly welcome you and your team to build upon our work, conduct experiments or theoretical analyses, and publish your research findings in relevant journals. When the time comes, we would be delighted to explore various ways to seamlessly integrate this method into abess.
Lastly, we would like to express our sincere gratitude for your active usage of abess
and your proactive inquiries. Your engagement is greatly appreciated.
No worries - I have some bioinformatics master students working on this exact topic to figure out what approach would or would not work... I'll let you know once we find out... In general, I think having provisions for inference, including for high dimensional inference, would be a very worthwhile and much requested feature. But it is true it would ideally need some theory...
More of a feature request - do you think it might be possible to implement any form of statistical inference in
abess
? E.g. provide confidence intervals on coefficients via bootstrapping? Or repeatedly bootstrap the dataset & refit anabess
model on each of those bootstrapped datasets, calculate the union of selected variables across all fits on bootstrapped datasets, repeat this until the size of the union of selected variables no longer grows & then refit a single regular GLM using base R's GLM function on this union of selected variables (I don't know if something like this has ever been suggested in the literature - I thought I had seen some suggestion along these lines in one of the Tibshirani articles on selective inference, but I can't seem to find it now).