General question about Permutation feature importance

TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions

http://eli5.readthedocs.io

MIT License

2.75k stars 331 forks source link

General question about Permutation feature importance #373

Open seralouk opened 4 years ago

seralouk commented 4 years ago

Hi all,

For the Permutation feature importance procedure, the default iteration value n_iter is 5 .

See: https://eli5.readthedocs.io/en/latest/autodocs/sklearn.html#eli5.sklearn.permutation_importance.PermutationImportance

I am looking for a reference or publication that justifies the selection of any n_iter value.

What is the gold standard or most commonly used n_iter value?

LEMTideman commented 4 years ago

Hi @seralouk, in my experience, the more iterations of permutation importance, the more reliable the results. Permutation importance essentially returns the decrease in model accuracy due to randomly shuffling the values of a feature (i.e. column of your data matrix). So the more random seeds you average over, the more robust the resulting estimate of feature importance. The number of iterations you need depends on your application: if you know how precise you want your estimates of feature importance to be, you could try plotting the variance of these estimates versus the number of iterations, and use that to choose n_iter.

seralouk commented 4 years ago

Good idea to plot the variance as a function of iterations.

I was hoping that there would be a rule of thumb that connects the number of iterations with the actual number of samples that are available in a study.