Open lang-benjamin opened 6 months ago
Thanks for asking about permutation p-values. I hadn't thought about those before, but they can be calculated with the varimp()
function. Below is an example of permutation-based calculations for variable importance followed by p-values. This method may differ from the Altmann et al. paper but is permutation-based nonetheless. Like variable importance, these p-values can computed for any model and with any appropriate performance metric supplied by the package.
# Load analytic packages
library(MachineShop)
library(ggplot2)
# Set up a parallel backend for faster permutations
library(doParallel)
registerDoParallel()
# Fit any MachineShop model
mdl_fit <- fit(sale_amount ~ ., data = ICHomes, model = GLMModel)
# Permutation variable importance
vi <- varimp(mdl_fit, samples = 1000)
plot(vi)
# Permutation p-values
## Custom varimp() stats function to compute permutation p-values
## Argument x is the difference between permuted and observed model performances
## for a variable
pval <- function(x) {
c("pvalue" = min(2 * mean(x <= 0), 1))
}
## Call varimp() with the p-value function
permpval <- varimp(
mdl_fit,
scale = FALSE,
samples = 1000,
stats = pval
)
plot(permpval) + labs(y = "Permutation p-value")
Thank you for the comment. I really like the flexibility of the package.
In addition to the permutation-based feature importance, there is permutation-based p-values for the feature importance (Altmann, A., Tolosi, L., Sander, O. & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure, Bioinformatics 26:1340-1347). There is essentially only the
ranger
package that implements this via theimportance_pvalues
function. Would you think that such a function is helpful? I could imagine that this may aid in judging whether a feature is relevant or not.