Closed experiencor closed 4 years ago
@Eric-Wallace probably has the best answer here. I know it's what he did in previous papers, but I don't remember why.
This paper has an explanation in Section 2 https://arxiv.org/abs/1804.07781.
Basically, one definition of the "importance" of a word is the change in probability when that word is removed. So, when you do gradient * embedding, it simulates what would happen if I set the embedding to the all zero vector. This isn't quite removing the word, but hopefully its a close approximation to it.
https://github.com/allenai/allennlp/blob/b85c86cff6f0995002dca6216ba2e3aefe403d11/allennlp/interpret/saliency_interpreters/simple_gradient.py#L39
I don't find any explanation for this line in the paper "AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models."
What is the justification for the multiplication of Grad and Embedding instead of raw Grad?
Thank you.