Closed nilslacroix closed 2 years ago
Hi, I must start by saying that we did not test dalex
with datasets of such magnitude. In fact, default N
(number of observations used for estimation) and B
(bootstrap rounds) parameters in methods like predict_parts
, model_parts
, model_profile
are set to rather low values to speed up the analysis, and for the user to increase them as needed.
May I ask if your explainers support out of the box parallel computing for cpu or gpu? If yes which explainers?
GPU is not supported. We implemented some CPU parallelization (the processes=1
parameter) over variables for predict_profile
and model_profile
, and over bootstrap rounds for model_parts
and predict_parts
.
Also since there is a fast implementation for shap with models like lgbm, xgboost and catboost is your shap implementation supporting GPUs? If not can I manually set shap values and shap interaction values in your interface?
We created predict_parts(type="shap_wrapper")
and model_parts(type="shap_wrapper")
methods to allow creating explanations from the shap
package consistently in our API, e.g. it should use the TreeSHAP algorithm for xgboost
. For highly optimized analysis (especially GPU support), I would suggest extracting shap values from the mentioned model/boosting packages, or the original shap
package, or the very new FastTreeSHAP
project https://github.com/linkedin/FastTreeSHAP, which comes with a paper https://arxiv.org/abs/2109.09847 and, I believe, was evaluated on such a large datasets.
Hope this answers the questions.
Thanks for the fast reply, this answers my question. I already tried a few simple examples, but I never saw some advanced code so far. Can dalex handle pipelines in scikit? Especially with regard to get_feature_name_out(), transformers and scaling?
For example If I one hot encode a few categorical columns, scale my numerical features with a RobustScaler() does dalex offer a property to combine the one-hot encoded columns into one feature in the plots? And regarding scaling do I have to manually apply a reverse transformation (for example if I log(x+1) my target, can i define np.expm1(x+1) property in dalex so the features are on the right scale for interpretation?).
dalex
handles scikit-learn pipelines and this is the recommended approach to work with explanations of preprocessed data. As for the target variable, it is possible to change predict_function
to
define np.expm1(x+1) property in dalex so the features are on the right scale for interpretation
Example of working with a pipeline: https://dalex.drwhy.ai/python-dalex-titanic.html Example of transforming the target: https://dalex.drwhy.ai/python-dalex-fifa.html
Also, you might try to use dalex
with models trained on GPU but I am unsure how this will behave. Anyhow, a large bottleneck of computing explanations is making a prediction; if one improves inference time through working on GPU, then dalex
will compute explanations faster. (This would probably require defining custom predict_function
with putting the data onto a GPU device)
Well there is sklearnex, which improves inference speed on many models if you use an intel CPU. The transformation of y is good, but is there an attribute for the columns itself, since the explaination part is mostly aimed at the features in X. For example can you do something like this to adjust ticks/labels in the graphics?
df["Feature"] = scaler.inverse_transform(df["Feature"])
where scaler is for example the scikitlearn scaler.
I don't understand your point. If u use a pipeline, there is no need to scale features for plots, as shown in https://dalex.drwhy.ai/python-dalex-titanic.html
I believe this one is solved. If not, please open a new issue : )
So I read about dalex and took a look into the API and tutorials. I have a big dataset with around 1.5million samples and 50+ features and I am planning to do an extensive XAI report on it.
May I ask if your explainers support out of the box parallel computing for cpu or gpu? If yes which explainers?
Also since there is a fast implementation for shap with models like lgbm, xgboost and catboost is your shap implementation supporting GPUs? If not can I manually set shap values and shap interaction values in your interface?
Best regards!