Feature importance p-values

lang-benjamin commented 6 months ago

In addition to the permutation-based feature importance, there is permutation-based p-values for the feature importance (Altmann, A., Tolosi, L., Sander, O. & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure, Bioinformatics 26:1340-1347). There is essentially only the ranger package that implements this via the importance_pvalues function. Would you think that such a function is helpful? I could imagine that this may aid in judging whether a feature is relevant or not.

brian-j-smith commented 6 months ago

Thanks for asking about permutation p-values. I hadn't thought about those before, but they can be calculated with the varimp() function. Below is an example of permutation-based calculations for variable importance followed by p-values. This method may differ from the Altmann et al. paper but is permutation-based nonetheless. Like variable importance, these p-values can computed for any model and with any appropriate performance metric supplied by the package.

# Load analytic packages
library(MachineShop)
library(ggplot2)

# Set up a parallel backend for faster permutations
library(doParallel)
registerDoParallel()

# Fit any MachineShop model
mdl_fit <- fit(sale_amount ~ ., data = ICHomes, model = GLMModel)

# Permutation variable importance

vi <- varimp(mdl_fit, samples = 1000)
plot(vi)

# Permutation p-values

## Custom varimp() stats function to compute permutation p-values
## Argument x is the difference between permuted and observed model performances
## for a variable
pval <- function(x) {
  c("pvalue" = min(2 * mean(x <= 0), 1))
}

## Call varimp() with the p-value function
permpval <- varimp(
  mdl_fit,
  scale = FALSE,
  samples = 1000,
  stats = pval
)
plot(permpval) + labs(y = "Permutation p-value")

lang-benjamin commented 6 months ago

Thank you for the comment. I really like the flexibility of the package.

brian-j-smith / MachineShop

Feature importance p-values #10