ModelOriented / DALEX

moDel Agnostic Language for Exploration and eXplanation
https://dalex.drwhy.ai
GNU General Public License v3.0
1.37k stars 166 forks source link

Short question regarding performance GPU/CPU support #500

Closed nilslacroix closed 2 years ago

nilslacroix commented 2 years ago

So I read about dalex and took a look into the API and tutorials. I have a big dataset with around 1.5million samples and 50+ features and I am planning to do an extensive XAI report on it.

May I ask if your explainers support out of the box parallel computing for cpu or gpu? If yes which explainers?

Also since there is a fast implementation for shap with models like lgbm, xgboost and catboost is your shap implementation supporting GPUs? If not can I manually set shap values and shap interaction values in your interface?

Best regards!

hbaniecki commented 2 years ago

Hi, I must start by saying that we did not test dalex with datasets of such magnitude. In fact, default N (number of observations used for estimation) and B (bootstrap rounds) parameters in methods like predict_parts, model_parts, model_profile are set to rather low values to speed up the analysis, and for the user to increase them as needed.

May I ask if your explainers support out of the box parallel computing for cpu or gpu? If yes which explainers?

GPU is not supported. We implemented some CPU parallelization (the processes=1 parameter) over variables for predict_profile and model_profile, and over bootstrap rounds for model_parts and predict_parts.

Also since there is a fast implementation for shap with models like lgbm, xgboost and catboost is your shap implementation supporting GPUs? If not can I manually set shap values and shap interaction values in your interface?

We created predict_parts(type="shap_wrapper") and model_parts(type="shap_wrapper") methods to allow creating explanations from the shap package consistently in our API, e.g. it should use the TreeSHAP algorithm for xgboost. For highly optimized analysis (especially GPU support), I would suggest extracting shap values from the mentioned model/boosting packages, or the original shap package, or the very new FastTreeSHAP project https://github.com/linkedin/FastTreeSHAP, which comes with a paper https://arxiv.org/abs/2109.09847 and, I believe, was evaluated on such a large datasets.

Hope this answers the questions.

nilslacroix commented 2 years ago

Thanks for the fast reply, this answers my question. I already tried a few simple examples, but I never saw some advanced code so far. Can dalex handle pipelines in scikit? Especially with regard to get_feature_name_out(), transformers and scaling?

For example If I one hot encode a few categorical columns, scale my numerical features with a RobustScaler() does dalex offer a property to combine the one-hot encoded columns into one feature in the plots? And regarding scaling do I have to manually apply a reverse transformation (for example if I log(x+1) my target, can i define np.expm1(x+1) property in dalex so the features are on the right scale for interpretation?).

hbaniecki commented 2 years ago

dalex handles scikit-learn pipelines and this is the recommended approach to work with explanations of preprocessed data. As for the target variable, it is possible to change predict_function to

define np.expm1(x+1) property in dalex so the features are on the right scale for interpretation

Example of working with a pipeline: https://dalex.drwhy.ai/python-dalex-titanic.html Example of transforming the target: https://dalex.drwhy.ai/python-dalex-fifa.html

Also, you might try to use dalex with models trained on GPU but I am unsure how this will behave. Anyhow, a large bottleneck of computing explanations is making a prediction; if one improves inference time through working on GPU, then dalex will compute explanations faster. (This would probably require defining custom predict_function with putting the data onto a GPU device)

nilslacroix commented 2 years ago

Well there is sklearnex, which improves inference speed on many models if you use an intel CPU. The transformation of y is good, but is there an attribute for the columns itself, since the explaination part is mostly aimed at the features in X. For example can you do something like this to adjust ticks/labels in the graphics?

df["Feature"] = scaler.inverse_transform(df["Feature"])

where scaler is for example the scikitlearn scaler.

hbaniecki commented 2 years ago

I don't understand your point. If u use a pipeline, there is no need to scale features for plots, as shown in https://dalex.drwhy.ai/python-dalex-titanic.html

hbaniecki commented 2 years ago

I believe this one is solved. If not, please open a new issue : )