DFKI-NLP / thermostat

Collection of NLP model explanations and accompanying analysis tools
Apache License 2.0
145 stars 8 forks source link

LIME token similarity kernel for input.shape[0] > 1 #9

Closed nfelnlp closed 3 years ago

nfelnlp commented 3 years ago

The assertion in the token_similarity_kernel function of ExplainerLimeBase

assert original_input.shape[0] == perturbed_input.shape[0]  == 1

https://github.com/nfelnlp/thermostat/blob/24177342945e834552a6df956ae59fdf1e69335b/src/thermostat/explainers/lime.py#L47

only works for IMDB so far. An error is thrown for MNLI, so I debugged it and found out that the two input shapes can still be equal, although they're not exactly 1. I assume this is because it has two text fields ("premise", "hypothesis") instead of one. The calculation below can still be performed with .shape[0]==2.

Do you think removing the == 1 at the end of the assertion would be fine?

rbtsbg commented 3 years ago

That should be the batch dimension, which should be set in the configs.

The two MNLI input strings are typically concatenated by the tokenizer, using a SEP separator token.

The tensor that is passed to the explainer should be the one returned in data.py where such things are handled, i.e. the concatenation.

Note that some models do not have the batch dim at position 0. In these cases a workaround needs to be devised.

nfelnlp commented 3 years ago

Ah, of course, thanks. Totally forgot about the batch dimension. So for LIME, the batch dimension should always be one? I will write a separate assert thingy for that then. What do you suggest for the workaround you mentioned?

nfelnlp commented 3 years ago

Batch size and internal_batch_size has to be exactly 1 in order to run explainer jobs with LIME.