dmolitor / bolasso

Model consistent Lasso estimation through the bootstrap.
https://dmolitor.github.io/bolasso/
Other
3 stars 0 forks source link

BoLasso Algorithm #6

Closed Pascal-Schmidt closed 2 years ago

Pascal-Schmidt commented 2 years ago

Thanks for the package! I have a question about the algorithm.

In the paper the algorithm states that regularization parameter mu is set before the for loop and so every bootstrap sample is using the same one I think. In the algorithm you implemented, do you tune the penalty term for every bootstrap sample with k-fold cross validation?

I also tried the bootLassoOLS function from the HDCI package but get slightly different results. Your algorithm gives the best variable selection when fitting a OLS regression after though.

I also tried to implement the algorithm myself but also get different results so I am going wrong somewhere. Maybe I am getting the decay wrong. I am creating 128 bootsrtap samples and then find the best penalty term for each sample with 10 fold cross-validation and then refit the model with the best penalty term on the entire bootstrap sample again and record all the coefficients that are non zero.

Here is a code snippet of how it looks:

library(tidymodels)
library(tidyverse)
library(broom)

lasso_spec <- linear_reg(penalty = tune::tune(), mixture = 1) %>%
  set_engine("glmnet")

df_boot <- recipe_lasso %>% recipes::prep() %>% recipes::juice()
brain_folds <- rsample::bootstraps(df_boot, times = 128)
lambda_grid <- dials::grid_latin_hypercube(penalty(), size = 50)

res <- dplyr::tibble()
for (i in 1:nrow(brain_folds)) {

  df <- analysis(brain_folds$splits[[i]])
  recipe_lasso <- recipes::recipe(
    Cattell_TotalScore ~ ., data = df
  ) %>%
    recipes::step_dummy(all_nominal()) %>% 
    recipes::step_normalize(all_numeric(), -all_outcomes()) 

  wf_lasso <- workflows::workflow() %>%
    workflows::add_recipe(recipe_lasso) %>% 
    workflows::add_model(lasso_spec)

  folds <- rsample::vfold_cv(df, v = 10)

  lasso_grid <- tune::tune_grid(
    wf_lasso,
    metrics = metrics,
    resamples = folds,
    grid = lambda_grid
  )

  lowest_rmse <- lasso_grid %>%
    tune::select_best("rmse", maximize = FALSE)

  final_lasso <- finalize_workflow(
    wf_lasso,
    lowest_rmse
  ) %>% 
    parsnip::fit(df)

  res <- res %>% 
    dplyr::bind_rows(
      final_lasso %>%
        workflows::extract_fit_parsnip() %>% 
        broom::tidy() %>% 
        dplyr::filter(estimate != 0) %>% 
        dplyr::mutate(boot = paste0("boot: ", i))
    )
  print(i)

}

res %>% 
  dplyr::group_by(term) %>% 
  dplyr::summarise(
    n = dplyr::n(),
    avg = mean(estimate)
  ) %>% 
  dplyr::ungroup() %>% 
  dplyr::mutate(prop = n/128) %>% 
  dplyr::filter(prop >= 0.98)

So my question would just be if you could just in 4-5 bullet points explain how you implemented the algorithm thta would be great so I can understand what was wrong with my code better. Thank you.

dmolitor commented 2 years ago

Hey Pascal, no problem. It's interesting to see somebody other than me actually testing it out! 😁

Choosing regularization parameter

You are correct that it doesn't use the same regularization parameter across all bootstrap replicates but, as you noted, tunes it using k-fold CV on every bootstrap replicate. While this might not exactly line up with the paper's algorithm description (honestly can't quite remember), my understanding is that in practice this leads to better convergence/performance.

Why we get different results

Based on the code snippet you provided, I'm pretty sure you're not doing anything wrong per se. If I could take a guess as to why you're seeing different performance on your own implementation, I would say it's most likely because you're creating/tuning your own lambda regularization parameter. At least I'm pretty sure this is what's happening here (tbh I don't use tidymodels so I'm doing a bit of guessing here):

lasso_grid <- tune::tune_grid(
    wf_lasso,
    metrics = metrics,
    resamples = folds,
    grid = lambda_grid
)

lowest_rmse <- lasso_grid %>%
    tune::select_best("rmse", maximize = FALSE)

The glmnet documentation says the following regarding manually tuning the lambda regularization parameter:

Optional user-supplied lambda sequence; default is NULL, and glmnet chooses its own sequence. Note that this is done for the full model (master sequence), and separately for each fold. The fits are then alligned using the master sequence (see the allignment argument for additional details). Adapting lambda for each fold leads to better convergence. When lambda is supplied, the same sequence is used everywhere, but in some GLMs can lead to convergence issues

In short, glmnet suggests using their automatically generated lambda sequences and I just follow that convention within bolasso. I'm guessing this is why we're seeing somewhat different results!

What exactly bolasso is doing under the hood

The following code shows pretty much exactly what the package is actually doing. The package just adds a bunch more junk so the different Lasso implementations play nice together 😆

data(PimaIndiansDiabetes, package = "mlbench")

### Results from the Bolasso package #######################################

library(bolasso)
#> Loading required package: Matrix

set.seed(123)
model <- bolasso(
  diabetes ~ .,
  data = PimaIndiansDiabetes,
  n.boot = 100, 
  implement = "glmnet",
  family = "binomial"
)
#> Loaded glmnet 4.1-4
selected_vars(model, threshold = 0.98)
#> # A tibble: 5 × 2
#>   variable  mean_coef
#>   <chr>         <dbl>
#> 1 Intercept   -8.15  
#> 2 pregnant     0.119 
#> 3 glucose      0.0348
#> 4 mass         0.0821
#> 5 pedigree     0.849

### What it's doing under the hood #########################################

library(dplyr)
library(tidyr)

set.seed(123)
bootstraps <- lapply(
  1:100,
  function(i) {
    idx <- sort(sample(nrow(PimaIndiansDiabetes), replace = TRUE))
    PimaIndiansDiabetes[idx, ]
  }
)
coefs <- lapply(
  bootstraps,
  function(d) {
    X <- model.matrix(diabetes ~ . - 1, data = d)
    y <- d$diabetes
    drop(coef(cv.glmnet(x = X, y = y, family = "binomial"), s = "lambda.min"))
  }
)
# The reason the coefficients aren't 100% identical is randomness from
# glmnet selecting the k-folds.
coefs |>
  bind_rows() |>
  summarise(
    across(
      .fns = list("prop" = ~ (\(i) sum(i == 0)/100)(.x), "mean" = ~ mean(.x))
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = c("var", "meas"),
    names_sep = "_"
  ) |>
  group_by(var) |>
  mutate(drop = value[meas == "prop"] < .02) |>
  ungroup() |>
  filter(drop, meas == "mean") |>
  select(var, value)
#> # A tibble: 5 × 2
#>   var           value
#>   <chr>         <dbl>
#> 1 (Intercept) -8.17  
#> 2 pregnant     0.119 
#> 3 glucose      0.0350
#> 4 mass         0.0823
#> 5 pedigree     0.854

Hopefully this all is helpful!

Pascal-Schmidt commented 2 years ago

Thanks for the response :)

I guess the difference is the lambda sequence then. I was using a grid_latin_hypercube but willl try to implement the algoorithm with the glmnet package an their lambda sequence.

Thanks again for the package and putting it on CRAN!