First of all, great job! I am adopting your solution to my problem. In my case I have a regression model (lstm) that calculates a number of sales that I then have to distribute among each of the predictor features (according to their weight, given by the SHAP value).
At the point where I am at three questions arise.
The example you show with your fraud detection data is about a classification example. Would there be any modifications to be made to bring it to our problem? Example: the shapley values that you obtain are between 0 and 1, because the prediction is always in this range in your case, and your plots are prepared for this scale. In my case, my target does not have a defined range and making a standardization of the target I do not know if it could disturb my current results. Moreover, standardizing the target does not ensure to have target values in that range for the test data.
From the data you load I see that you use a fixed look_back (length of each sequence), while for me it is a parameter that I play with to include more or less information and that the lstm learns more or less from the past. Does this affect anything (prunning algorithm, etc.)?
Another question I have is about the interpretability of the SHAP values. How, from the SHAP values obtained (both in the temporal dimension and at the feature level), could I make an attribution of the predicted quantity to each of the predictor features? Or in other words, how are these SHAP values interpreted quantitatively? Since I see that in most occasions the analyses only focus on seeing which characteristics influence more or less the prediction, but not how much more or less with respect to the rest.
Sorry for the extension and thank you very much for your contribution.
Antonio
Although we did not specifically test TimeSHAP applied to regression, TimeSHAP is implemented on top of KernelSHAP which works for regression. I do not see a reason on why TimeSHAP would not be applicable to regression where the range of scores are not between 0 and 1. In case you test this, we would appreciate the feedback, and in case you have an example we would be happy to add it to our examples.
In our examples we use a fixed sequence length only due to all sequences of the dataset being the same length. TimeSHAP explains one sequence at a time, and the sequence length is not a factor for TimeSHAP.
According to the game theory of the Shapley Values, the calculated explanations are a fair distribution of the model score across the considered features (and events/cells in TimeSHAP). In the application of KernelSHAP and consequently of TimeSHAP, the difference between the instance score and the baseline score is distributed fairly across the considered axis, and therefore, their explanations can be interpreted quantitatively. An intuitive example follows: Given a feature A with value 20, with a "background" (uninformative) value of 10; If the Shapley value of A is 0.2, that means that the predicted score will be 0.2 higher if A=20 than if A=10; or f(A=20) = f(A=10) + 0.2.
Hello,
First of all, great job! I am adopting your solution to my problem. In my case I have a regression model (lstm) that calculates a number of sales that I then have to distribute among each of the predictor features (according to their weight, given by the SHAP value).
At the point where I am at three questions arise.
The example you show with your fraud detection data is about a classification example. Would there be any modifications to be made to bring it to our problem? Example: the shapley values that you obtain are between 0 and 1, because the prediction is always in this range in your case, and your plots are prepared for this scale. In my case, my target does not have a defined range and making a standardization of the target I do not know if it could disturb my current results. Moreover, standardizing the target does not ensure to have target values in that range for the test data.
From the data you load I see that you use a fixed look_back (length of each sequence), while for me it is a parameter that I play with to include more or less information and that the lstm learns more or less from the past. Does this affect anything (prunning algorithm, etc.)?
Another question I have is about the interpretability of the SHAP values. How, from the SHAP values obtained (both in the temporal dimension and at the feature level), could I make an attribution of the predicted quantity to each of the predictor features? Or in other words, how are these SHAP values interpreted quantitatively? Since I see that in most occasions the analyses only focus on seeing which characteristics influence more or less the prediction, but not how much more or less with respect to the rest.
Sorry for the extension and thank you very much for your contribution. Antonio