Doubts on provided notebook: dataset format, model configuration

franciscomalveiro commented 1 year ago

Hello again!

By following the notebook you provide and trying to adapt TimeSHAP to my use case, I have come across a few doubts.

1. Regarding the format of the dataset.

In the model interface provided, it is referred that

In order for TimeSHAP to explain a model, an entry point must be provided. This Callable entry point must receive a 3-D numpy array, (#sequences; #sequence length; #features) and return a 2-D numpy array (#sequences; 1) with the corresponding score of each sequence.

It is the sort of dataset I am using too.

Returning back to the notebook, I started to analyse how the data was arranged in order to use the framework (please correct me if there is something wrong):

There are two dataframes containing the manipulated data, d_train and d_test;
You use d_train to train the ExplainedRNN model and d_test to generate explanations (normalised when there is the need to it);
Each sequence is identified by a key, given by the all_id feature;
Each element of each sequence is identified by the feature named timestamp.
Each element of each sequence has got a label feature.

My doubt resides on this last point:

This label feature does not reflect the classification of each timestamp, but rather the classification attributed to the sequence as a whole (the intended use case), right?
- this setting allows having different elements of one sequence classified differently? (not the intended use case, just curious)
- one should propagate the classification of each sequence throughout all its elements?

2. Regarding one of the provided use cases

In the model interface, there is a reference to an ExplainedLSTM model

a tuple of tuples of numpy arrays (usually used when using LSTM's) (class ExplainedLSTM on notebook);;

So I found it on the API showcase notebook. This model is very similar to the ones I am using (LSTM + Linear layer; TransformerEncoder + Linear layer)

I tried to run the notebook selecting this model, but it failed to run on cell (sorry to paste the code here; notebook referencing on issues should be easier to do...)

from timeshap.utils import get_avg_score_with_avg_event
avg_score_over_len = get_avg_score_with_avg_event(f_hs, average_event, top=480)

with error RuntimeError: For batched 3-D input, hx and cx should also be 3-D but got (2-D, 2-D) tensors

But if it runs with the ExplainedRNN model, it should only be a little issue with the ExplainedLSTM one.

3. Regarding model adaptation

I have developed my models using pytorch, however there is a small difference with yours: given that I am using torch's BCEWithLogitsLoss, my models lack the application of sigmoid inside the model, as you do with ExplainedRNN. The solution I have been following with other explainability frameworks is to build a Wrapper around my models, where I apply the sigmoid function and convert to numpy.ndarray if necessary. I noticed that you too provide a wrapper to torch models, so I was wondering if it would be possible to integrate the two solutions.

Sorry for the long issue (I was almost willing to write each point in a separated issue) and thanks for reading!

PS: I am also using a variation of my models, where they handle variable-length sequences. Some explainability tools sometimes are a bit hard to use on this scenario, but I believe yours is not the case I took a look at #28 already):

Using a timestamp feature on your dataset allows for that use case - different sequences may have different lengths, where each element is identified by the pair (key, timestep) - in your notebooks the pair (all_id, timestamp).

(just wanting to be sure about it :) )

JoaoPBSousa commented 1 year ago

Hello @franciscomalveiro ,

1. Regarding the label in our toy dataset, you are correct that it represents the label for the entire sequence. However, TimeSHAP also works in scenarios where a label is required for each timestamp, as in the financial dataset we reported in our paper. Note that, regardless of the dataset or use case, TimeSHAP explains one event of a sequence at a time.

2. Regarding your error with ExplainedLSTM, I was unable to replicate the issue you encountered. However, I did notice that there were some errors in the class definition of ExplainedLSTM. We have fixed these issues and the fix will be implemented in the next version update. Could you please let us know if you continue to encounter the same errors after the update? The correct class definition is as follows:

class ExplainedLSTM(nn.Module):
    def __init__(self,
                 input_size: int,
                 cfg: dict,
                 ):
        super(ExplainedLSTM, self).__init__()  <-- Change this line
        (...)

    def forward():
        y = self.classifier_block(output[:, -1, :])  <-- Change this line
        y = self.output_activation_func(y)
        return y, hidden

3. From what I understand of your issue, it should be possible to combine the two solutions. We created the model wrappers specifically for these types of cases.

PS. That is correct. TimeSHAP is capable of handling sequences of different lengths, and we use the pair (all_id, timestamp) to identify them in the dataset.

I hope this answer is helpful. If you have any further questions feel free to ask.

franciscomalveiro commented 1 year ago

Hello!

Regarding the error with ExplainedLSTM, I have fixed the code as you said, and it is all going smoothly now.

Regarding the combination of my wrapper with yours, I have implemented it and it seems to be working as well.

Meanwhile I have faced the same issue as #38 - it happens when trying to generate global explanations using an average event as baseline: Score difference between baseline and instance (0.06294184923171997) is too low < 0.1.Baseline score: 0.3232707977294922 | Instance score: 0.2603289484977722.Consider choosing another baseline.

I have tried using average_sequence but the result is the same. However, when generating some specific local explanations, it goes smoothly. It might be related to some instances on my dataset that generate low score differences. So, as you suggested, meanwhile I may change the source code to see if I can generate those explanations.

When following your notebook again to try to change the baseline to average_sequence, the resulting event plot appeared like this:

This plot is part of the image generated by the cell on the same notebook: local_report(f_hs, pos_x_data, pruning_dict, event_dict, feature_dict, cell_dict, average_sequence, entity_uuid=positive_sequence_id, entity_col='all_id') (I have changed average_event to average_sequence - no other variable was changed.)

I am still finishing the adaptation of the framework to my use case, so more doubts may appear - do you prefer that I post each question individually on its own issue or to continue this thread?

Thanks for your help!

JoaoPBSousa commented 1 year ago

Hello @franciscomalveiro ,

Regarding the score difference, using the average event or the average sequence will not directly solve this issue. It may work for some sequences, but it's important to note that explanations will differ depending on which approach is used, as the background instance will change. As mentioned in issue #38, we are considering two possible approaches and would greatly appreciate your input. Please let us know if either of these options would be suitable for your use case:

Allowing users to define the threshold value instead of having it fixed at 0.1.
Providing an option to skip the check altogether if the user desires.

Regarding the plot you shared, I can see that the plot appears unusual. However, I couldn't identify any specific issues at first glance. This could be due to a significant number of events not being pruned, causing the plot to display all of them. To address this, you could try increasing the pruning tolerance (to show fewer events), adjusting the plot size, or simply ignore this plot and directly analyze the explanation dataframe.

If you have any general questions that you believe would benefit others, please create a separate issue. However, if these errors are specific to your case, feel free to continue the discussion in this thread.

franciscomalveiro commented 1 year ago

Wouldn't it be possible to combine both options (like a variable that can be defined as None)? As of what I've read from your answer to #38, you have found the threshold value empirically (0.1 for your dataset) - it is likely that it should be defined differently for other datasets, right? Perhaps a method to find this value depending on the dataset and baseline used would be the best option (and the hardest, I suppose :stuck_out_tongue: )

Regarding the weird plot, I will look deeper into pruning stats and parameters to check if that is the issue.

When I was trying to change the baseline to debug the issue regarding that threshold, I have stumbled on this description of background:

background : numpy.array or pd.DataFrame
        The background event/sequence to use for integrating out features. To determine the impact
        of a feature, that feature is set to "missing" and the change in the model output
        is observed. Since most models aren't designed to handle arbitrary missing data at test
        time, we simulate "missing" by replacing the feature with the values it takes in the
        background dataset. So if the background dataset is a simple sample of all zeros, then
        we would approximate a feature being missing by setting it to zero.
        In TimeSHAP you can use an average event or average sequence.
        When using average events, consider using `timeshap.calc_avg_event` method to obtain it.
        When using average sequence, considering using `timeshap.calc_avg_sequence` method to obtain it.
        Note that when using the average sequence, all sequences of the dataset need to be the same.

I have used the methods you describe to generate average events and sequences, however, I am confused when you state that "all sequences of the dataset need to be the same". If one is trying to explain different sequences, how can all sequences of the dataset be the same? Can one only use this method if the dataset is composed by repetitions of the same sequence?

franciscomalveiro commented 1 year ago

Regarding the threshold, I have opted for now to skip the test altogether, to inspect the results.

While generating global explanations, I have tried to include pruning, and the following error popped up: AssertionError: Pruning idx must be smaller than the sequence length. If not all events are pruned

To train certain models, I had to perform sequence padding / truncation, and for that reason, some sequences may use the same element in all timesteps (namely sequences which originally were of size one). May this lead to the pruning of all events?
May this error be related to not using the threshold?

JoaoPBSousa commented 1 year ago

Hi @franciscomalveiro

Wouldn't it be possible to combine both options (like a variable that can be defined as None)? As of what I've read from your answer to https://github.com/feedzai/timeshap/issues/38, you have found the threshold value empirically (0.1 for your dataset) - it is likely that it should be defined differently for other datasets, right? Perhaps a method to find this value depending on the dataset and baseline used would be the best option

Regarding this point, we will consider integrating both options as you have proposed. Additionally, we will discuss internally the possibility of implementing a method to automatically determine the threshold value.

I have used the methods you describe to generate average events and sequences, however, I am confused when you state that "all sequences of the dataset need to be the same". If one is trying to explain different sequences, how can all sequences of the dataset be the same? Can one only use this method if the dataset is composed by repetitions of the same sequence?

There is indeed a typo in our description. It should read: all sequences of the dataset need to be the same length. Our language here may be too strong as this is more a soft requirement to aimed at ensuring that older events (in terms of sequence position) do not carry less statistical significance. To illustrate this, consider a dataset where the majority of sequences have a length of less than 100 elements, but there are a few outlier sequences with more than 100 elements. Considering the background sequence, the first 100 elements will represent the average across the majority of the dataset. However, the older events of the average sequence (beyond the 100-element mark) will merely reflect the average of the respective events of the outlier sequences, which may not be as representative or generalizable as desired for the background sequence. Thank you for noting this, we will update the description accordingly.

While generating global explanations, I have tried to include pruning, and the following error popped up: AssertionError: Pruning idx must be smaller than the sequence length. If not all events are pruned

This error is one of the reasons why we introduced the threshold feature, but regardless, it should be resolved even if the threshold is not considered. Could you please provide us with the version of TimeSHAP you are currently using? I attempted to replicate the error you mentioned, but I was not able to do so. Nevertheless, this error stems from the temp_coalition_pruning method. Specifically, the error can occur when all events are pruned, which happens when the background and the explained sequence are very similar. However, these lines here should have prevented the error.

I hope this answer is helpful, and I will be waiting for more information.

franciscomalveiro commented 1 year ago

I am using timeshap version 1.0.3. I have been trying to replicate the error in the notebook you provide, so that it could be easier to you to check, but I have not been successeful yet. I will take a look at that method also, and to the files and arguments I am providing to those functions to check if the problem resides there.

franciscomalveiro commented 1 year ago

So the error comes from here: https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/kernel/timeshap_kernel.py#L287

Because pruning_idx == X.shape[1] = 4.

pruning_idx was originally 0, and the sequence length (4) was added to it. https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/event_level.py#L332-L335

Inspecting the generated pruning_idx dataframe, there are a total of 573 (191 sequences * 3 tolerances (-1, 0.05, 0.075)) entries, 100 of which have got a pruning_idx = 0.

I have also inspected the lines you have pointed out, namely https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L190-L197 Note that both condtions expect that tolerance is not None

However, when performing a complete pruning with prune_all, https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L309-L319

the argument tolerance is indeed None when the function temp_coalition_pruning is called:

https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L407 Since pruning_idx is initialised to 0, https://github.com/feedzai/timeshap/blob/e155bb2ac9f8b437038507a41e8d3b684c047508/src/timeshap/explainer/pruning.py#L176

this value never changes, it is further added to it the sequence length, and the error pops up.

JoaoPBSousa commented 1 year ago

Hi @franciscomalveiro ,

Thank you for providing such a detailed analysis. It seems that the issue is related to the close score between the baseline and the explained instance. As a result, during the pruning process, TimeSHAP is able to meet the tolerance threshold by replacing the entire sequence with the baseline, leading to the pruning of the entire sequence.

The error being raised in the timeshap_kernel is accurate, as pruning the entire sequence is not a valid option for generating explanations. To address this, I can suggest two potential solutions:

Change the baseline instance to one that has a higher score difference. However, please note that this change will alter the explanations provided by TimeSHAP.
Skip the pruning process entirely in instances where pruning is not possible. While this can resolve the issue with pruning, it's important to note that interpreting the explanations may become challenging due to the low score difference. Note that this option is the one being used in these lines.

I hope this answer is helpful. If you have any further questions feel free to ask.

franciscomalveiro commented 1 year ago

I think it is all for now, anything more that comes up, I'll reopen the issue. Thanks a lot for your help!

feedzai / timeshap