feedzai / timeshap

TimeSHAP explains Recurrent Neural Network predictions.
Other
162 stars 30 forks source link

[Question/Request] Input parameters of global_report for specific data sequence #18

Closed Mrgr4vity closed 2 years ago

Mrgr4vity commented 2 years ago

Hey, thanks for the great work.

I have some questions about using this library for my own dataset (sorry if my questions might be easy to solve, I have been trying to solve them for three days, but no result).

I have an RNN model and an input dataset with the shape of (9517, 87, 37). It has a type of "np.ndarray" and each of the sequences (9517) has only one label (each sequence is an 87x37 table that has only one label).

First of all, I have a problem with the "calc_avg_event" function because its input is "pd.dataframe" which, in my case, each sequence (9517) is different from other sequences, and it doesn't have a meaning to convert it to a single dataframe. However, I can write a customized code that calculates the median and mode for each of the 9517 sequences and then calculate the median and mode again for all of the 9517 sequences.

The main problem, though, is the input parameters of the "global_report" function. More specifically, I don't know how to fill the "entity_col", "time_col", and "model_features" parameters of this function based on my dataset.

The example for TF has a different dataset, and I'm confused about whether the library can handle my dataset.

I really appreciate any help you can provide.

JoaoPBSousa commented 2 years ago

Hello @Mrgr4vity,

Regarding the first question, the calc_avg_event is supposed to receive all sequences used to train the model and output the "average" event. This "average" event is supposed to be the average event across the whole dataset, and not tailored to a specific sequence, meaning, that for each specific feature the average/median/mode is calculated across all 9517 * 87 = 827979 events of the training dataset. Essentially we will calculate the average/median/mode of all features using a pd.dataframe of shape (827979, 37).

Regarding the second question, the fields you are referring to have a two uses: (1) - They allow to decompose the data dataframe (shape (827979, 37)) into usable columns for TimeSHAP. We take the full dataset, select a specific entity_col and order by time_col to obtain the explained sequence. model_features are used to (2) - After obtaining the explanations, we add these fields to the explanations in order to be able to persist them all in a single csv file. Regarding the application of TimeSHAP to your specific use case, these columns/parameters loose their value.

In the case you want to use TimeSHAP out of the box on your specific dataset, I would advise you to take your input dataset of shape (9517, 87, 37), and create a pandas dataframe with shape (827979, 39) = (9517 * 87, 37 + 2). In this case, you would add two new features to your dataset, an entity_col representing a sequence_id in order to identify the 9517 individual sequences, and a time_col with the index of each event 0 to 87 with 0 being the first event in order to order each sequence accordingly if necessary.

Thank you very much for raising this question. We will discuss a possible change to the interface where our methods do not require these columns and are able to work on a 3D dataset. If you have any further questions don't hesitate to contact.

JoaoPBSousa commented 2 years ago

Closed this issue due to inactivity. If you have any further questions feel free to re-open the issue or create a new one.