TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions
http://eli5.readthedocs.io
MIT License
2.76k stars 332 forks source link

Add a function to compute "Mean decrease in accuracy" feature importances #203

Closed kmike closed 7 years ago

kmike commented 7 years ago

It seems it shouldn't be model-specific, and it should support metrics other than accuracy. See https://github.com/scikit-learn/scikit-learn/issues/8898 for more details.

jnothman commented 7 years ago

it can also be calculated on a held-out set, such that you measure importance to generalisation

On 23 May 2017 12:56 am, "Mikhail Korobov" notifications@github.com wrote:

It seems it shouldn't be model-specific, and it should support metrics other than accuracy. See scikit-learn/scikit-learn#8898 https://github.com/scikit-learn/scikit-learn/issues/8898 for more details.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TeamHG-Memex/eli5/issues/203, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz64ZgiN7zuZxkvunvApNft2rtRvIvks5r8aI1gaJpZM4NieV6 .

kmike commented 7 years ago

Ah yeah, I was thinking such function should always receive a held-out dataset. Is there a way around it?

In random forests such datasets are available naturally, so for them MDA feature importances can be calculated as a part of training, without having an explicit held-out dataset. But I'm not sure it can be implemented with unchanged sklearn, xgboost and lightgbm.

jnothman commented 7 years ago

well calculating on the training set will be no worse than current feature_importances except where that is already based on oob estimates

On 23 May 2017 8:30 am, "Mikhail Korobov" notifications@github.com wrote:

Ah yeah, I was thinking such function should always receive a held-out dataset. Is there a way around it?

In random forests such datasets are available naturally, so for them MDA feature importances can be calculated as a part of training, without having an explicit held-out dataset. But I'm not sure it can be implemented with unchanged sklearn, xgboost and lightgbm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/TeamHG-Memex/eli5/issues/203#issuecomment-303237556, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6xjfft0G5zqsCa1PceEYh9FE63jmks5r8gx0gaJpZM4NieV6 .

kmike commented 7 years ago

Fixed by #227.