一句话总结：

这是一篇分析BERT中attention究竟关注了什么的文章。发现BERT里的attention heads表现出了一些规律，比如关注 delimiter tokens, specific positional offsets, or broadly attending over the whole sentence. 同一个layer里的heads表示出来的行为也基本相似。比如head经常关注动词的对象，名字的限定词（the），objects of prepositions (介词宾语), coreferent mentions.

资源：

pdf
code
[paper-with-code](

论文信息：

Author: Stanford, Facebook
Dataset:
keywords:

笔记：

1 Introduction

Our analysis focuses on the 144 attention heads in BERT。

Both BERT model sizes have a large number of encoder layers (which the paper calls Transformer Blocks) – twelve for the Base version, and twenty four for the Large version. These also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the Transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).

在BERT里的修改了默认的Transformer结构，用了12层encoder，每个encoder里有12个heads，所以一共是144个attention heads.

We first explore generally how BERT’s attention heads behave. We find that there are common patterns in their behavior, such as attending to fixed positional offsets or attending broadly over the whole sentence. A surprisingly large amount of BERT’s attention focuses on the deliminator token [SEP], which we argue is used by the model as a sort of no-op. Generally we find that attention heads in the same layer tend to behave similarly.

总的来说每个layer上，head呈现相似的行为。

Complementary to these approaches, we study the attention maps of a pre-trained model.

主要是针对attention map.

We next probe each attention head for linguistic phenomena. In particular, we treat each head as a simple no-training-required classifier that, given a word as input, outputs the most-attended-to other word. We then evaluate the ability of the heads to classify various syntactic relations.

只有这一小部分是关于如何计算“importance”的。把每个head当做一个classifier，然后输入一个word，输出最关注的另一个words。（换到RE的话，就应该是输入句子，输出关注单词了）

These results are intriguing because the behavior of the attention heads emerges purely from self-supervised training on unlabeled data, without explicit supervision for syntax or coreference. (这个可以用在intro上，说明为什么需要给BERT引入syntax信息)

测试每个attention head在不同task上的效果。虽然没有发现一个单独的head在所有的task有好的表现，但是发现对于一些特别的现象。比如head能找到动词的对象，名词的nouns, 介词对象，possessive pronouns（代名词）。在Coreference resolution (共指解析)上测试后，发现BERT确实表现很好。这些结果很有趣，它表现了self-attention模型在未标注数据上的学习结果，这些学习结果是没有受到明显explicit supervision for syntax or coreference.

Our findings show that particular heads specialize to specific aspects of syntax. To get a more overall measure of the attention heads’ syntactic ability, we propose an attention-based probing classifier that takes attention maps as input. The classifier achieves 77 UAS at dependency parsing, showing BERT’s attention captures a substantial amount about syntax. Several recent works have proposed incorporating syntactic information to improve attention (Eriguchi et al., 2016; Chen et al., 2018; Strubell et al., 2018). Our work suggests that to an extent this kind of syntax-aware attention already exists in BERT, which may be one of the reason for its success。

这一段有说明了BERT中已经学到了dep信息。但是这篇文章说明了BERT中已经有了syntax信息，不需要引入syntactic information。而那些引入了syntactic information来提高attention的论文可以看一下，Eriguchi et al., 2016; Chen et al., 2018; Strubell et al., 2018.

Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neural machine translation. In ACL. Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, and Tiejun Zhao. 2018. Syntax-directed attention for neural machine translation. In AAAI. Emma Strubell, Patrick Verga, Daniel Andor, David I Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In EMNLP.

2 Background: Transformers and BERT

(可以用到写self-attention的文章里)

We use the “base” sized BERT model, which has 12 layers containing 12 attention heads each. We use - to denote a particular attention head.

3 Surface-Level Patterns in Attention

3.1 Relative Position

我们发现大部分heads对于current token的关注度很低。但是一部分head关注了next或previous token，尤其是那些earlier layers。（这么看的话，这些heads起到了关注相邻token的作用，所以对于RE来说，引入relative position可能并没有什么必要。因为这部分信息已经被BERT学到了）

3.2 Attending to Separator Tokens

发现很多attention关注到一小部分token。比如在6-10层，一半的attention关注于[SEP]。

第一张图里，1-3层，对CLS关注高。5-10层，对SEP关注高。对于SEP的高关注度有点异常啊。第二张图，当当前token是SEP的时候，对于SEP的关注度更高了。

For example, over half of BERT’s attention in layers 6-10 focuses on [SEP]. To put this in context, since most of our segments are 128 tokens long, the average attention for a token occurring twice in a segments like [SEP] would normally be around 1/64.

这句话没看懂

这可能是因为SEP和CLS总是可见的，没有被mask掉，而逗号，句号可能是除了the之外最常见的token。所以attention才对这些token更“关注”。

3.3 Focused vs Broad Attention

一些lower layer上的attention将其关注力分散到每一个word上。这些heads的输出可以看做是sentence的bag-of-vecotr representation.

测量所有attention 对于 [CLS] token的entropies。大部分[CLS] 的 entorpies close to the ones shown in Figure 4 (哈？close to 什么玩意？)，最后一层对于[CLS]有很高的entorpy值，这表明了broad attention. 这个发现很合理，因为在pre-train的时候， [CLS] token是用来作为输入，预测“next sentence prediction”的。

4 Probing Individual Attention Heads

we investigate individual attention heads to probe what aspects of language they have learned. In particular, we evaluate attention heads on labeled datasets for tasks like dependency parsing. An overview of our results is shown in Figure 5. (检查个别head学到了什么东西。在标注的数据集上进行验证)

4.1 Method

虽然我们想要在word-level 上验证 heads，但是BERT使用的是byte-pair tokenization，这说明一些单词(8% in our data) 被分割成多了多个token。因此我们需要将 token-token attention maps 变成 word-word attention maps。 For attention to a split-up word, we sum up the attention weights over its tokens. For attention from a split-up word, we take the mean of the attention weights over its tokens. （这里的to, from究竟是什么意思？）

We ignore [SEP] and [CLS], although in practice this does not significantly change the accuracies for most heads.

4.2 Dependency Syntax

这部分和作者通邮件，主要是用dependent来与预测head。table 1表示了没有单个attention能在 syntax上表现出综合能力强的结果。baseline是使用了fixed offst。

接下来说了一些在不同dependency relation上观测的结果。Figure 5展示了一些attention的行为。尽管机器学习到的attention weight和人工标记的syntactic relaton之间有一定相似性，比较吃惊，我们注意到有些relation有特定的擅长领域。

4.3 Coreference Resolution

这部分跳过

5 Probing Attention Head Combinations

因为个别的attention heads在某些syntax上效果好，所以模型在syntax上学习的“知识”是分散到不同attention heads上的。所以我们提出一个新的方法来计量模型在syntax整体上学习的“知识”，具体方法是提出了a novel family of attention-based probing classifiers，然后把这些分类器用到dependency parsing. 对于这些分类器，把BERT attention output当做fixed，即不使用back-propagate.

The probing classifiers are basically graphbased dependency parsers. Given an input word, the classifier produces a probability distribution over other words in the sentence indicating how likely each other word is to be the syntactic head of the current one （预测head）

7 Related Work

Michel et al. (2019) similarly show that many of BERT’s attention heads can be pruned. Although our analysis in this paper only found interpretable behaviors in a subset of BERT’s attention heads, these recent works suggest that there might not be much to explain for some attention heads because they have little effect on model perfomance.

https://arxiv.org/abs/1905.10650

BrambleXu / knowledge-graph-learning

arXiv-2019/06-What Does BERT Look At? An Analysis of BERT's Attention #228