bdwilliamson / vimp

Perform inference on algorithm-agnostic variable importance
http://bdwilliamson.github.io/vimp/
Other
21 stars 8 forks source link

`cv_vim` arguments `cross_fitted_f1` and `cross_fitted_f2` are swapped #16

Closed ddimmery closed 3 years ago

ddimmery commented 3 years ago

Hi again, I have a further issue in working with VIMP :)

One seems to be a straightforward typo somewhere in the cv_vim function in which the arguments cross_fitted_f2 and cross_fitted_f1 are swapped. The second may just be a bug in my thinking about your method.

For both, here is a repro on simulated data:

library(vimp)
library(purrr)
library(dplyr)
library(estimatr)

set.seed(100)
n <- 1000
df <- dplyr::tibble(
    id = 1:n,
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = runif(n),
    y = x1 + 0.25 * x3 + rnorm(n),
    split_id = sample(10, n, replace = TRUE)
)

x_df <- select(df, x1, x2, x3)

full_lm <- lm_robust(y ~ x1 + x2 + x3, df)
print(summary(full_lm))
Call:
lm_robust(formula = y ~ x1 + x2 + x3, data = df)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept) -0.08031    0.06218 -1.2914  1.969e-01 -0.20233  0.04172 996
x1           1.02250    0.03096 33.0289 3.661e-162  0.96175  1.08325 996
x2          -0.01543    0.03340 -0.4619  6.442e-01 -0.08098  0.05012 996
x3           0.37734    0.10817  3.4882  5.075e-04  0.16506  0.58961 996

Multiple R-squared:  0.5237 ,   Adjusted R-squared:  0.5223 
F-statistic: 363.7 on 3 and 996 DF,  p-value: < 2.2e-16

As expected, x1 and x3 are nonzero and x2 is null. The data is very informative about these relationships.

validRows <- purrr::map(unique(df$split_id), ~which(.x == df$split_id))
cv_ctl <- SuperLearner::SuperLearner.CV.control(V = length(validRows), validRows = validRows)

full_fit <- SuperLearner::CV.SuperLearner(
    Y = df$y,
    X = x_df,
    SL.library = c("SL.glm"),
    cvControl = cv_ctl
)

df$full_preds <- full_fit$SL.predict

df %>%
group_by(split_id) %>%
summarize(rsq = 1 - mean((y - full_preds)^2) / var(y)) -> rsq_full

results <- list()
for (cov in names(x_df)) {
    idx <- which(names(x_df) == cov)
    red_fit <- SuperLearner::CV.SuperLearner(
        Y = full_fit$SL.predict,
        X = x_df[, -idx, drop = FALSE],
        SL.library = c("SL.glm"),
        cvControl = cv_ctl
    )

    df$red_preds <- red_fit$SL.predict

    df %>% 
    group_by(split_id) %>%
    summarize(rsq = 1 - mean((y - red_preds)^2) / var(y)) -> rsq_reduced

    result <- vimp::cv_vim(
        Y = df$y,
        type = "r_squared",
        cross_fitted_f1 = full_fit$SL.predict,
        cross_fitted_f2 = red_fit$SL.predict,
        SL.library = c("SL.glm"),
        cross_fitting_folds = df$split_id,
        run_regression = FALSE,
        scale_est = TRUE,
        sample_splitting = FALSE
    )
    vimp_rough <- inner_join(rsq_full, rsq_reduced, by = "split_id") %>% summarize(rsq=mean(rsq.x - rsq.y))
    results[[cov]] <- mutate(result$mat, term = cov, rough_vimp = unlist(vimp_rough))
}

bind_rows(!!!results)

This gives the result:

# A tibble: 3 × 9
  s        est      se     cil     ciu test   p_value term  rough_vimp
  <chr>  <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl> <chr>      <dbl>
1 1     0.184  0.0275  0.130   0.238   TRUE  1.20e-11 x1      0.507   
2 1     0      0.00115 0       0.00225 FALSE 1.00e+ 0 x2     -0.000690
3 1     0.0202 0.00532 0.00978 0.0306  TRUE  7.22e- 5 x3      0.00317

That is, x1 and x3 are estimated to have zero importance.

I get the estimates that are appropriately non-zero for x1 and x2 when I swap the arguments cross_fitted_f1 and cross_fitted_f2 in the call to cv_vim and rerun the same code:

# A tibble: 3 × 9
  s        est      se     cil     ciu test   p_value term  rough_vimp
  <chr>  <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <dbl> <chr>      <dbl>
1 1     0.184  0.0275  0.130   0.238   TRUE  1.20e-11 x1      0.507   
2 1     0      0.00115 0       0.00225 FALSE 1.00e+ 0 x2     -0.000690
3 1     0.0202 0.00532 0.00978 0.0306  TRUE  7.22e- 5 x3      0.00317

This seems like a simple case of having the two arguments swapped somewhere in your code (this can also be verified by setting scale_est = FALSE). I imagine this popped up at some point as you were adding functionality. Perhaps this would be worth a test-case to make sure this doesn't happen in the future?

I'm additionally a bit confused about why the "rough_vimp" measure doesn't more closely align with the results from cv_vim. My reading of your papers suggests that when I'm not doing sample splitting, the estimator is (essentially) the average of the difference in R^2 within splits like this. I'm specifically thinking about Algorithm 2 in your Biometrics and in your preprint (https://arxiv.org/abs/2004.03683). This also seems to align with your more specific discussion of Example 1 in Section 9 of the preprint. Am I missing something obvious here?

bdwilliamson commented 3 years ago

Thanks for your thoughtful message!

First, a couple of comments on your reprex: (1) explicitly loading SuperLearner (with library("SuperLearner")) allows the code to be copied-and-pasted (and run) without error; (2) I was initially confused by this issue since the two tables of results are identical (but upon running the code, I see that the results are as you describe).

Your issue is interesting for several reasons, mostly because it is very close to correct (but not quite).

(1) The validRows object you created is slightly mismatched with the actual validation rows specified in df. SuperLearner.CV.control requires each element in the list to correspond to the vth validation fold (i.e., the first element in the list should correspond to df$split_id == 1); your code has the first element in the list corresponding to the v = 9 validation fold. To fix this, swap in the following: validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id)).

(2) cross_fitted_f1 and cross_fitted_f2 should be lists, not vectors; inputting vectors leads the algorithm to only consider the vth element of the vector (rather than the entire vth validation-set predictions). You can use the function extract_sampled_split_predictions to obtain list from a CV.SuperLearner fit object; running similar code to your example (but with the aforementioned fix) yields the following:

cross_fitted_f1 <- extract_sampled_split_predictions(
  cvsl_obj = full_fit, sample_splitting = FALSE, 
  sample_splitting_folds = rep(1, 10), full = TRUE
)
results_list <- list()
set.seed(1234)
for (cov in names(x_df)) {
  idx <- which(names(x_df) == cov)
  red_fit <- SuperLearner::CV.SuperLearner(
    Y = full_fit$SL.predict,
    X = x_df[, -idx, drop = FALSE],
    SL.library = c("SL.glm"),
    cvControl = cv_ctl
  )

  cross_fitted_f2 <- extract_sampled_split_predictions(
    cvsl_obj = red_fit, sample_splitting = FALSE,
    sample_splitting_folds = rep(2, 10), full = FALSE
  )

  result <- vimp::cv_vim(
    Y = df$y,
    type = "r_squared",
    cross_fitted_f1 = cross_fitted_f1,
    cross_fitted_f2 = cross_fitted_f2,
    SL.library = c("SL.glm"),
    cross_fitting_folds = df$split_id,
    run_regression = FALSE,
    scale_est = TRUE,
    sample_splitting = FALSE
  )
  results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)
# A tibble: 3 x 8
  s       est       se   cil     ciu test  p_value term 
  <chr> <dbl>    <dbl> <dbl>   <dbl> <lgl>   <dbl> <chr>
1 1     0.516 0.0225   0.471 0.560   TRUE    0     x1   
2 1     0     0.000939 0     0.00184 FALSE   1.00  x2   
3 1     0     0.00346  0     0.00677 FALSE   0.609 x3 

We correctly pick up that x1 is important and x2 is unimportant; x3 should be important, but also has a quite small true effect, so it's likely that we need a larger sample size to pick that up.

For your final question, you're essentially right, but we can actually gain efficiency by estimating the variance of the outcome in the entire sample; estimating the mean squared error in each CV fold; and then computing R-squared based on these two quantities. So in small samples (n = 1000 could count as "small") you might see a difference between the approach in cv_vim and the fold-specific approach that you took.

ddimmery commented 3 years ago

Ah, thank you so much! And sorry for the minor issues in the repro. Too much copy and pasting as I made small changes :)

I'm really glad that the problem was, in fact, me doing something stupid.

I think the documentation confused me a bit: in cv_vim you write "the predicted values [...]; a list of length V" -- I interpreted this semi-colon as there being two options for ways to input this data. I think it could be reworded so that it's clear the list is necessary (e.g. "the predicted values [...] in the training data provided as a list of length V"). Perhaps it could also throw an error if this argument is not a list? It doesn't seem like there should ever be a case where it isn't, should there? (or alternatively since cross_fitting_folds is an argument as well, it seems like this list object could be constructed automatically inside cv_vim).

bdwilliamson commented 3 years ago

Thanks for letting me know about the documentation, I'll fix that! And adding a check/error if the argument isn't a list is a good idea, while I figure out if we can construct a list inside cv_vim.