Closed franciscomalveiro closed 1 year ago
Hello @franciscomalveiro ,
1. Regarding the label in our toy dataset, you are correct that it represents the label for the entire sequence. However, TimeSHAP also works in scenarios where a label is required for each timestamp, as in the financial dataset we reported in our paper. Note that, regardless of the dataset or use case, TimeSHAP explains one event of a sequence at a time.
2. Regarding your error with ExplainedLSTM, I was unable to replicate the issue you encountered. However, I did notice that there were some errors in the class definition of ExplainedLSTM. We have fixed these issues and the fix will be implemented in the next version update. Could you please let us know if you continue to encounter the same errors after the update? The correct class definition is as follows:
class ExplainedLSTM(nn.Module):
def __init__(self,
input_size: int,
cfg: dict,
):
super(ExplainedLSTM, self).__init__() <-- Change this line
(...)
def forward():
y = self.classifier_block(output[:, -1, :]) <-- Change this line
y = self.output_activation_func(y)
return y, hidden
3. From what I understand of your issue, it should be possible to combine the two solutions. We created the model wrappers specifically for these types of cases.
PS.
That is correct. TimeSHAP is capable of handling sequences of different lengths, and we use the pair (all_id, timestamp)
to identify them in the dataset.
I hope this answer is helpful. If you have any further questions feel free to ask.
Hello!
Regarding the error with ExplainedLSTM, I have fixed the code as you said, and it is all going smoothly now.
Regarding the combination of my wrapper with yours, I have implemented it and it seems to be working as well.
Meanwhile I have faced the same issue as #38 - it happens when trying to generate global explanations using an average event as baseline:
Score difference between baseline and instance (0.06294184923171997) is too low < 0.1.Baseline score: 0.3232707977294922 | Instance score: 0.2603289484977722.Consider choosing another baseline.
I have tried using average_sequence
but the result is the same. However, when generating some specific local explanations, it goes smoothly. It might be related to some instances on my dataset that generate low score differences. So, as you suggested, meanwhile I may change the source code to see if I can generate those explanations.
When following your notebook again to try to change the baseline to average_sequence
, the resulting event plot appeared like this:
This plot is part of the image generated by the cell on the same notebook:
local_report(f_hs, pos_x_data, pruning_dict, event_dict, feature_dict, cell_dict, average_sequence, entity_uuid=positive_sequence_id, entity_col='all_id')
(I have changed average_event
to average_sequence
- no other variable was changed.)
I am still finishing the adaptation of the framework to my use case, so more doubts may appear - do you prefer that I post each question individually on its own issue or to continue this thread?
Thanks for your help!
Hello @franciscomalveiro ,
Regarding the score difference, using the average event or the average sequence will not directly solve this issue. It may work for some sequences, but it's important to note that explanations will differ depending on which approach is used, as the background instance will change. As mentioned in issue #38, we are considering two possible approaches and would greatly appreciate your input. Please let us know if either of these options would be suitable for your use case:
Regarding the plot you shared, I can see that the plot appears unusual. However, I couldn't identify any specific issues at first glance. This could be due to a significant number of events not being pruned, causing the plot to display all of them. To address this, you could try increasing the pruning tolerance (to show fewer events), adjusting the plot size, or simply ignore this plot and directly analyze the explanation dataframe.
If you have any general questions that you believe would benefit others, please create a separate issue. However, if these errors are specific to your case, feel free to continue the discussion in this thread.
Wouldn't it be possible to combine both options (like a variable that can be defined as None
)? As of what I've read from your answer to #38, you have found the threshold value empirically (0.1 for your dataset) - it is likely that it should be defined differently for other datasets, right? Perhaps a method to find this value depending on the dataset and baseline used would be the best option (and the hardest, I suppose :stuck_out_tongue: )
Regarding the weird plot, I will look deeper into pruning stats and parameters to check if that is the issue.
When I was trying to change the baseline to debug the issue regarding that threshold, I have stumbled on this description of background
:
background : numpy.array or pd.DataFrame
The background event/sequence to use for integrating out features. To determine the impact
of a feature, that feature is set to "missing" and the change in the model output
is observed. Since most models aren't designed to handle arbitrary missing data at test
time, we simulate "missing" by replacing the feature with the values it takes in the
background dataset. So if the background dataset is a simple sample of all zeros, then
we would approximate a feature being missing by setting it to zero.
In TimeSHAP you can use an average event or average sequence.
When using average events, consider using `timeshap.calc_avg_event` method to obtain it.
When using average sequence, considering using `timeshap.calc_avg_sequence` method to obtain it.
Note that when using the average sequence, all sequences of the dataset need to be the same.
I have used the methods you describe to generate average events and sequences, however, I am confused when you state that "all sequences of the dataset need to be the same". If one is trying to explain different sequences, how can all sequences of the dataset be the same? Can one only use this method if the dataset is composed by repetitions of the same sequence?
Regarding the threshold, I have opted for now to skip the test altogether, to inspect the results.
While generating global explanations, I have tried to include pruning, and the following error popped up:
AssertionError: Pruning idx must be smaller than the sequence length. If not all events are pruned
To train certain models, I had to perform sequence padding / truncation, and for that reason, some sequences may use the same element in all timesteps (namely sequences which originally were of size one). May this lead to the pruning of all events?
May this error be related to not using the threshold?
Hi @franciscomalveiro
Wouldn't it be possible to combine both options (like a variable that can be defined as None)? As of what I've read from your answer to https://github.com/feedzai/timeshap/issues/38, you have found the threshold value empirically (0.1 for your dataset) - it is likely that it should be defined differently for other datasets, right? Perhaps a method to find this value depending on the dataset and baseline used would be the best option
Regarding this point, we will consider integrating both options as you have proposed. Additionally, we will discuss internally the possibility of implementing a method to automatically determine the threshold value.
I have used the methods you describe to generate average events and sequences, however, I am confused when you state that "all sequences of the dataset need to be the same". If one is trying to explain different sequences, how can all sequences of the dataset be the same? Can one only use this method if the dataset is composed by repetitions of the same sequence?
There is indeed a typo in our description. It should read: all sequences of the dataset need to be the same length. Our language here may be too strong as this is more a soft requirement to aimed at ensuring that older events (in terms of sequence position) do not carry less statistical significance. To illustrate this, consider a dataset where the majority of sequences have a length of less than 100 elements, but there are a few outlier sequences with more than 100 elements. Considering the background sequence, the first 100 elements will represent the average across the majority of the dataset. However, the older events of the average sequence (beyond the 100-element mark) will merely reflect the average of the respective events of the outlier sequences, which may not be as representative or generalizable as desired for the background sequence. Thank you for noting this, we will update the description accordingly.
While generating global explanations, I have tried to include pruning, and the following error popped up: AssertionError: Pruning idx must be smaller than the sequence length. If not all events are pruned
This error is one of the reasons why we introduced the threshold feature, but regardless, it should be resolved even if the threshold is not considered. Could you please provide us with the version of TimeSHAP you are currently using?
I attempted to replicate the error you mentioned, but I was not able to do so. Nevertheless, this error stems from the temp_coalition_pruning
method. Specifically, the error can occur when all events are pruned, which happens when the background and the explained sequence are very similar. However, these lines here should have prevented the error.
I hope this answer is helpful, and I will be waiting for more information.
I am using timeshap version 1.0.3
.
I have been trying to replicate the error in the notebook you provide, so that it could be easier to you to check, but I have not been successeful yet.
I will take a look at that method also, and to the files and arguments I am providing to those functions to check if the problem resides there.
So the error comes from here: https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/kernel/timeshap_kernel.py#L287
Because pruning_idx == X.shape[1] = 4
.
pruning_idx
was originally 0, and the sequence length (4) was added to it. https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/event_level.py#L332-L335
Inspecting the generated pruning_idx
dataframe, there are a total of 573 (191 sequences * 3 tolerances (-1, 0.05, 0.075)) entries, 100 of which have got a pruning_idx = 0
.
I have also inspected the lines you have pointed out, namely
https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L190-L197
Note that both condtions expect that tolerance is not None
However, when performing a complete pruning with prune_all
,
https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L309-L319
the argument tolerance
is indeed None
when the function temp_coalition_pruning
is called:
https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L407
Since pruning_idx
is initialised to 0,
https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L176
this value never changes, it is further added to it the sequence length, and the error pops up.
Hi @franciscomalveiro ,
Thank you for providing such a detailed analysis. It seems that the issue is related to the close score between the baseline and the explained instance. As a result, during the pruning process, TimeSHAP is able to meet the tolerance threshold by replacing the entire sequence with the baseline, leading to the pruning of the entire sequence.
The error being raised in the timeshap_kernel is accurate, as pruning the entire sequence is not a valid option for generating explanations. To address this, I can suggest two potential solutions:
I hope this answer is helpful. If you have any further questions feel free to ask.
I think it is all for now, anything more that comes up, I'll reopen the issue. Thanks a lot for your help!
Hello again!
By following the notebook you provide and trying to adapt TimeSHAP to my use case, I have come across a few doubts.
1. Regarding the format of the dataset.
In the model interface provided, it is referred that
It is the sort of dataset I am using too.
Returning back to the notebook, I started to analyse how the data was arranged in order to use the framework (please correct me if there is something wrong):
d_train
andd_test
;d_train
to train theExplainedRNN
model andd_test
to generate explanations (normalised when there is the need to it);all_id
feature;timestamp
.label
feature.My doubt resides on this last point:
label
feature does not reflect the classification of each timestamp, but rather the classification attributed to the sequence as a whole (the intended use case), right?2. Regarding one of the provided use cases
In the model interface, there is a reference to an
ExplainedLSTM
modelSo I found it on the API showcase notebook. This model is very similar to the ones I am using (
LSTM
+Linear
layer;TransformerEncoder
+Linear
layer)I tried to run the notebook selecting this model, but it failed to run on cell (sorry to paste the code here; notebook referencing on issues should be easier to do...)
with error
RuntimeError: For batched 3-D input, hx and cx should also be 3-D but got (2-D, 2-D) tensors
But if it runs with the
ExplainedRNN
model, it should only be a little issue with theExplainedLSTM
one.3. Regarding model adaptation
I have developed my models using pytorch, however there is a small difference with yours: given that I am using torch's
BCEWithLogitsLoss
, my models lack the application ofsigmoid
inside the model, as you do withExplainedRNN
. The solution I have been following with other explainability frameworks is to build aWrapper
around my models, where I apply thesigmoid
function and convert tonumpy.ndarray
if necessary. I noticed that you too provide a wrapper to torch models, so I was wondering if it would be possible to integrate the two solutions.Sorry for the long issue (I was almost willing to write each point in a separated issue) and thanks for reading!
PS: I am also using a variation of my models, where they handle variable-length sequences. Some explainability tools sometimes are a bit hard to use on this scenario, but I believe yours is not the case I took a look at #28 already):
timestamp
feature on your dataset allows for that use case - different sequences may have different lengths, where each element is identified by the pair (key, timestep) - in your notebooks the pair (all_id
,timestamp
).(just wanting to be sure about it :) )