Query on attribution calculation for GRU

Hello, I'm attempting to calculate attributions using integrated gradients and occlusion on a 1-layer GRU with BERT embeddings at the input. I notice that the highest attributions are almost always assigned to the last few tokens (mostly padding tokens), which looks like it's due to the directional nature of RNNs. I wonder if you faced a similar issue with your models and if you have any thoughts on how to resolve this? From the example texts in your paper, it looks like attributions are correctly assigned to sentiment bearing tokens however going through the repo to figure out seems tedious.

NPoe / neural-nlp-explanation-experiment

Query on attribution calculation for GRU #8