TeamHG-Memex / eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions
http://eli5.readthedocs.io
MIT License
2.75k stars 331 forks source link

Single Record explain_prediction response time issue #257

Open ganeshmailbox opened 6 years ago

ganeshmailbox commented 6 years ago

We are having a xgboost model (0.6a) and we are trying to do explain_prediction on eli5 (0.8) and we are seeing significant response time issue (of order of 4 secs for a single row). Our model is of around 200+ variables. We would like you help to see if there are any options to improve the performance/response time for a single row in pandas data frame pdData significantly. Here are some of options (in vain) we tried and questions we have on the same

eli5.explain_prediction(xgbmodel, pdData.values[0], top=(top_n+1), feature_names=feat_names))

  1. We tried to use the feature_filter to see if by reducing the features, if it improves the performance, but we saw no improvement in performance, but just a reduction in the response.
  2. Do you think vectorization will help to improve response time for a single record?
  3. Is there any example on how to implement the vectorization for xgboost model for explain_prediction call?
  4. Any Options to improve the performance in general? Any suggestion is welcome.
  5. Should try to create smaller xgboost model to see if this improves? if so what is the criteria to determine the model size? using explain_weights (variable importance )an option?
lopuhin commented 6 years ago

@ganeshmailbox I'm afraid that the only workaround is to simplify the xgboost model: number of trees/iterations and number of leaves. I'm not an XGBoost tuning expert, but I would try making above two parameters lower, maybe also check if changing learning rate helps, and check validation metrics and eli5.explain_prediction performance - maybe you will be able to reach still satisfactory performance with a smaller model. Complexity of eli5.explain_prediction for xgboost should be approximately linear in number of trees and depth of trees.

The proper fix here would be to profile the code and make optimizations, the bottleneck is most likely somewhere here https://github.com/TeamHG-Memex/eli5/blob/master/eli5/xgboost.py#L264-L412 but it's better to profile. I don't think this code was ever profiled and optimized, so there can be low-hanging fruit here. I would love to work on it but probably will have some time only in April. Another radically different option is to use SHAP explanations as they are natively available in newer xgboost so should be faster, but they are not integrated into eli5 (discussed a bit here https://github.com/TeamHG-Memex/eli5/issues/254)