Closed ddimmery closed 3 years ago
Thanks for your thoughtful message!
First, a couple of comments on your reprex: (1) explicitly loading SuperLearner
(with library("SuperLearner")
) allows the code to be copied-and-pasted (and run) without error; (2) I was initially confused by this issue since the two tables of results are identical (but upon running the code, I see that the results are as you describe).
Your issue is interesting for several reasons, mostly because it is very close to correct (but not quite).
(1) The validRows
object you created is slightly mismatched with the actual validation rows specified in df
. SuperLearner.CV.control
requires each element in the list to correspond to the v
th validation fold (i.e., the first element in the list should correspond to df$split_id == 1
); your code has the first element in the list corresponding to the v = 9
validation fold. To fix this, swap in the following: validRows <- purrr::map(sort(unique(df$split_id)), ~which(.x == df$split_id))
.
(2) cross_fitted_f1
and cross_fitted_f2
should be lists, not vectors; inputting vectors leads the algorithm to only consider the v
th element of the vector (rather than the entire v
th validation-set predictions). You can use the function extract_sampled_split_predictions
to obtain list from a CV.SuperLearner
fit object; running similar code to your example (but with the aforementioned fix) yields the following:
cross_fitted_f1 <- extract_sampled_split_predictions(
cvsl_obj = full_fit, sample_splitting = FALSE,
sample_splitting_folds = rep(1, 10), full = TRUE
)
results_list <- list()
set.seed(1234)
for (cov in names(x_df)) {
idx <- which(names(x_df) == cov)
red_fit <- SuperLearner::CV.SuperLearner(
Y = full_fit$SL.predict,
X = x_df[, -idx, drop = FALSE],
SL.library = c("SL.glm"),
cvControl = cv_ctl
)
cross_fitted_f2 <- extract_sampled_split_predictions(
cvsl_obj = red_fit, sample_splitting = FALSE,
sample_splitting_folds = rep(2, 10), full = FALSE
)
result <- vimp::cv_vim(
Y = df$y,
type = "r_squared",
cross_fitted_f1 = cross_fitted_f1,
cross_fitted_f2 = cross_fitted_f2,
SL.library = c("SL.glm"),
cross_fitting_folds = df$split_id,
run_regression = FALSE,
scale_est = TRUE,
sample_splitting = FALSE
)
results_list[[cov]] <- mutate(result$mat, term = cov)
}
bind_rows(!!!results_list)
# A tibble: 3 x 8
s est se cil ciu test p_value term
<chr> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <chr>
1 1 0.516 0.0225 0.471 0.560 TRUE 0 x1
2 1 0 0.000939 0 0.00184 FALSE 1.00 x2
3 1 0 0.00346 0 0.00677 FALSE 0.609 x3
We correctly pick up that x1 is important and x2 is unimportant; x3 should be important, but also has a quite small true effect, so it's likely that we need a larger sample size to pick that up.
For your final question, you're essentially right, but we can actually gain efficiency by estimating the variance of the outcome in the entire sample; estimating the mean squared error in each CV fold; and then computing R-squared based on these two quantities. So in small samples (n = 1000 could count as "small") you might see a difference between the approach in cv_vim
and the fold-specific approach that you took.
Ah, thank you so much! And sorry for the minor issues in the repro. Too much copy and pasting as I made small changes :)
I'm really glad that the problem was, in fact, me doing something stupid.
I think the documentation confused me a bit: in cv_vim
you write "the predicted values [...]; a list of length V" -- I interpreted this semi-colon as there being two options for ways to input this data. I think it could be reworded so that it's clear the list is necessary (e.g. "the predicted values [...] in the training data provided as a list of length V"). Perhaps it could also throw an error if this argument is not a list? It doesn't seem like there should ever be a case where it isn't, should there? (or alternatively since cross_fitting_folds
is an argument as well, it seems like this list object could be constructed automatically inside cv_vim
).
Thanks for letting me know about the documentation, I'll fix that! And adding a check/error if the argument isn't a list is a good idea, while I figure out if we can construct a list inside cv_vim
.
Hi again, I have a further issue in working with VIMP :)
One seems to be a straightforward typo somewhere in the
cv_vim
function in which the argumentscross_fitted_f2
andcross_fitted_f1
are swapped. The second may just be a bug in my thinking about your method.For both, here is a repro on simulated data:
As expected,
x1
andx3
are nonzero andx2
is null. The data is very informative about these relationships.This gives the result:
That is,
x1
andx3
are estimated to have zero importance.I get the estimates that are appropriately non-zero for
x1
andx2
when I swap the argumentscross_fitted_f1
andcross_fitted_f2
in the call tocv_vim
and rerun the same code:This seems like a simple case of having the two arguments swapped somewhere in your code (this can also be verified by setting
scale_est = FALSE
). I imagine this popped up at some point as you were adding functionality. Perhaps this would be worth a test-case to make sure this doesn't happen in the future?I'm additionally a bit confused about why the "rough_vimp" measure doesn't more closely align with the results from
cv_vim
. My reading of your papers suggests that when I'm not doing sample splitting, the estimator is (essentially) the average of the difference in R^2 within splits like this. I'm specifically thinking about Algorithm 2 in your Biometrics and in your preprint (https://arxiv.org/abs/2004.03683). This also seems to align with your more specific discussion of Example 1 in Section 9 of the preprint. Am I missing something obvious here?