Open josh-marsh opened 3 years ago
Please read my blog post here maybe it'll give you more intuition how it works - this is an open-ended question 🤭 we take w = x[0, text_mask]
the first row vector because it is related to the [CLS]
token that holds the "general" information about the whole sentence (at least we believe that it's true). What I've done it's nothing new (or maybe a little bit). This is so called the "gradient-based attribution". I really recommend to read Yonatan Belinkov works, especially his excellent survey here
Thank you this is really useful!
One remaining question I have is why do you use w = x[0, text_mask]
and not w = x[text_mask, 0]
, as the first column also is related to the [CLS]
token for each sentence? They have different values, so whether the row or column related to [CLS]
is used affects w
and hence the visualizations.
Note if anyone is intrested, I found the best way to combine patterns was pattern_vectors.max(axis=0) / pattern_vectors.max()
.
Hi,
Firstly, just want to say what a wonderful resource this is! I have several questions about the BasicPatternRecognizer:
x = tf.reduce_sum(x, axis=[0, 1], keepdims=True)
to combine theattention_scores * gradients
for all heads and layers. I think I understand why this works, but I've never seen it done before.w = x[0, text_mask]
, specifically, what0
is doing; why do we care about the first row and why do we use it to calculate the importance of a given pattern?Thank you so much!
Josh