Feature importances: std calculation is not correct for random forest?

TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions

http://eli5.readthedocs.io

MIT License

2.76k stars 334 forks source link

Feature importances: std calculation is not correct for random forest? #121

Open lopuhin opened 7 years ago

lopuhin commented 7 years ago

Probably zero feature importances should not be considered when calculating std for random forest feature importances

kmike commented 7 years ago

It may be even more subtle. For random forests individual trees are fit on a subset of features, and we should take in account only features which are present in these subsets, even if they have zero feature importance.

On the other hand, at prediction time it doesn't matter if feature had zero importance or not, or if it was in a subset or not, so if we're looking at std deviation from "prediction" point of view it could make sense to keep it as-is.