Thank you for sharing the code.
I'm confusing something, I would appreciate if my understanding is correct.
Are you using the all heads output for the analysis?
The paper you mentioned 'Roles and Utilization of Attention Heads in Transformer-based Neural Language Models' appears to use only selected heads. But your code seems to use all heads. Is this correct?
After extracting all the features, are they concatenated and used as an input for a single linear binary classifier?
If they are concatenated, then the dimension of it would be quite large I guess.
Yes, they are concatenated. But it's okay because we use regularization in our logistic regression, so it works even if the amount of features is larger than the amount of examples in the train set.
Thank you for sharing the code. I'm confusing something, I would appreciate if my understanding is correct.
Are you using the all heads output for the analysis? The paper you mentioned 'Roles and Utilization of Attention Heads in Transformer-based Neural Language Models' appears to use only selected heads. But your code seems to use all heads. Is this correct?
After extracting all the features, are they concatenated and used as an input for a single linear binary classifier? If they are concatenated, then the dimension of it would be quite large I guess.