Great project! Anything I should know before integrating into auto_ml?

TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions

http://eli5.readthedocs.io

MIT License

2.75k stars 332 forks source link

Great project! Anything I should know before integrating into auto_ml? #195

Open ClimbsRocks opened 7 years ago

ClimbsRocks commented 7 years ago

Hi Team!

This looks like an awesome project. I built auto_ml to automate the machine learning process, and one of the frequently-requested features is to be able to explain why we predicted what we did. It's obviously not a trivial undertaking to do properly. So you can imagine my excitement when I saw that you guys had already taken care of it, and for all the different packages we use!

Is there anything I should know before integrating it into auto_ml? I'll make sure to give y'all a callout in the docs so people know where this cool functionality came from. Or if you just want to geek out for a few minutes about running open source ML projects, I'd love to do that too :)

Warmly, preston

kmike commented 7 years ago

Hey @ClimbsRocks,

It would be awesome if you integrate it with your auto_ml package! I think we can mention that in our docs as well. Please let us know if there are any issues.

ClimbsRocks commented 7 years ago

@kmike A few quick thoughts after my first quick attempt:

https://github.com/TeamHG-Memex/eli5/issues/196 would be exactly the functionality that I need. Given your current suggested workaround, would that look something like:
```
expl = eli5.explain_weights(model, vec)
expl_as_dict = elif.formatters.as_dict.format_as_dict(expl)
```
Why do the scikit-learn explainers not support importance_type?
Overall, this project should save me several hundred lines of code, and the pain of having to maintain all that code as the underlying libraries change. Needless to say, I'm pretty jazzed.

kmike commented 7 years ago

Hey @ClimbsRocks!

1) Yep; you can also use eli5.formatters.format_as_dict. We should expose it as eli5.format_as_dict, by the way, like other format methods (#206). @lopuhin is working on direct DataFrame support; any feedback for https://github.com/TeamHG-Memex/eli5/issues/196 is welcome!

2) Currently for ensembles we're using feature importances provided by libraries, and scikit-learn only implements a single importance type (equivalent to "gain" in xgboost and LightGBM). In future we should probably implement feature importance computation ourselves, so that it matches among library, and different feature importance methods are available for all libraries. But that's a good catch; I think even now we should add "importance_type" argument for scikit-learn, even if only a single value ("gain") is accepted.

3) Great to hear that! Such feedback makes me even more motivated :)

ClimbsRocks commented 7 years ago

Thanks for being such attentive project owners! I know it can be exhilarating and exhausting, so I appreciate your efforts here.

I'd love it if we could get more advanced feature importance information from the scikit-learn models. Happily they've got a pretty stable interface, so it's unlikely that code will need to be updated too regularly. But they're just such an industry standard that if you provide an easy way to get more detailed analytics from them, that would likely be a huge win.

And of course, I'm always a fan of consistent APIs, so I support taking in an importance_type param for scikit-learn.