一句话总结：

针对NMT task，这是一篇分析不同encoder中不同head具体有什么效果的文章。 Multi-head self-attention是Transformer里重要的构成。我们发现一些重要的head表现出了一致性，而且在语言解释性上有很强的效果。于是通过对不同要的head进行剪枝，we observe that specialized heads are last to be pruned(? 什么意思这句话)。我们的剪枝方法移除了大部分heads，而且对于精度的影响也并不大。

资源：

pdf
code
[paper-with-code](

论文信息：

Author: University of Amsterdam
Dataset:
keywords:

笔记：

1 Introduction

Recent attempts to investigate the kinds of information learned by the model’s encoder (Raganato and Tiedemann, 2018 #236 针对的是NMT，借鉴程度不高)

Previous analysis of multi-head attention considered the average of attention weights over all heads at a given position or focused only on the maximum attention weights (Voita et al., 2018 #237; Tang et al., 2018 #238), 但是两种方法都没有考虑不同heads的贡献是如何变化的。无法判断单个head提到的作用。

我们尝试回答下面的问题：

To what extent does translation quality depend on individual encoder heads?
Do individual encoder heads play consistent and interpretable roles? If so, which are the most important ones for translation quality?
Which types of model attention (encoder self-attention, decoder self-attention or decoder-encoder attention) are most sensitive to the number of attention heads and on which layers?
Can we significantly reduce the number of attention heads while preserving translation quality?

We start by identifying the most important heads in each encoder layer using layer-wise relevance propagation (Ding et al., 2017 #239 ). （先在每个encoder layer里找到最重要的head）。对于那些被判断为重要的heads，然后去判断这些heads起到了什么作用。有下面几个作用：

positional (heads attending to an adjacent token),
syntactic (heads attending to tokens in a specific syntactic dependency relation)
attention to rare words (heads pointing to the least frequent tokens in the sentence)

上面只是针对那些important heads的验证。对手那些剩下的heads，我们也需要进行判断，它们是否也有重要但是被忽视的作用，还是说只是单纯的redundant heads。这里引入了一个pruning heads的方法，这个方法基于 Louizos et al. (2018 #240 ). 尽管我们不能简单的直接把active heads的数量当做penalty term在learning object (L0 regularizer)，我们可以使用一个differentiable relaxation。在不断学习的过程中，对attention heads进行剪枝，从整个模型开始，然后直到只留下哪些有明显作用的heads。这些实验证明了 layer-wise relevance propagation; 特别是，那些明显包含positional和syntactic的heads留在最后剪枝，因此展示了对于translation task来说什么是真正重要的信息。

2 Transformer Architecture

The Transformer uses multi-head attention in three different ways: encoder self-attention, decoder self-attention and decoder-encoder attention. In this work, we concentrate primarily on encoder self-attention. （这篇文章只关注encoder self-attention）

3 Data and setting

Source language: English
Target languages: Russian, German and French

4 Identifying Important Heads

We define the “confidence” of a head as the average of its maximum attention weight excluding the end of sentence symbol,2 where average is taken over tokens in a set of sentences used for evaluation (development set). A confident head is one that usually assigns a high proportion of its attention to a single token. Intuitively, we might expect confident heads to be important to the translation task. (如何定义一个head是自信的。将head当做最大attention weight的平均值，不包括EOS。（没看懂这是个什么意思）。平均值指的是，从dev数据集中，不同sentence中同一个tokens的平均值？)

Layer-wise relevance propagation (LRP) (Ding et al., 2017 #239 ) is a method for computing the relative contribution of neurons at one point in a network to neurons at another.3 Here we propose to use LRP to evaluate the degree to which different heads at each layer contribute to the top-1 logit predicted by the model. Heads whose outputs have a higher relevance value may be judged to be more important to the model’s predictions. （LRP是用来计算不同的neuron相对于其他neurons的贡献的。这里我们提出使用LRP来计算，在每一层的不同head上，哪一个head对于top-1 logit 的预测是最高的）

LRP的结果在 Figures 1a, 2a, 2c. In each layer, LRP ranks a small number of heads as much more important than all others.

每个head的confidence在1b。我们观察到1a里LRP计算得到的relevance和1b里的confidence有相似的趋势。唯一明显的例外是1a中LRP 1-1 这个位置被LRP判断为很重要，但是其在1b里的condifence却很低。关于这个head，在section5.3还会进行分析。

5 Characterizing heads

这一部分调查head是否起到了作用。

对那些在LRP上被判断为重要的heads, 进一步分析是否在3个function上起到作用：（原来这里的funciton是功能啊，我理解成公式了。对于的RE的task的话，这里的function也要变）

positional: the head points to an adjacent token,
syntactic: the head points to tokens in a specific syntactic relation,
rare words: the head points to the least frequent tokens in a sentence.

5.1 Positional heads

如果一个head在90%的时间内（？）它的maximum attention weight都指向一个特定的relative position (-1 或 +1)，那么就应为它是"positional"的。紫色的部分，在Fig 1c English-Russian, 2b for English-German, 2d for English-French and marked with the relative position.

比较confident 的heads，以及被LRP认为是important的head，都可以被认为是"positional"的。（从图中可以看到，主要分布在2~5 layer上，尤其是layer 2的1，2heads，在三种语言的翻译上，都表现出了明显的"positional"特性。）

5.2 Syntactic heads

我们假设在进行翻译的时候，Transformer的encoder可能会自动学习syntactic structure相关的信息。因此我们希望知道一个head是否关注到了一个句子中的“major syntactic relations”（如何定义这个玩意？）。在我们的分析中，我们看到了下面一些dependency relation:

nominal subject(nsubj), 名词主语
direct object (dobj), 直接宾语
adjectival modifier(amod), 形容词
adverbial modifier (advmod), 状语还包括了其他一些main verbal auguments

5.2.1 Methodology

验证一下head attention weight对于 specific dependency relation的预测和 CoreNLP的预测结果。计算一下每一个head，这个head把maximum attention weight放到了哪个token上。（可以理解为 head ->(attention weight)->dependient).

这里的accuracy可以理解为用head预测dependent时候的准确率。一些dependency relation经常在一些固定的位置被观测到，见Fig 3）。如果accuracy 比baseline至少高10%的话，那么我们就说这个head是syntactic的。

5.2.2 Results

Table 1展示了最准确的head的准确率。可以看到表现最优秀的head，比baseline的表现要好。 Clearly certain heads learn to detect syntactic relations with accuracies significantly higher than the positional baseline. 这个证明了我们的假设，encoder确实起到了一定的 syntactic disambiguation的作用。

有一些head表现出了对相同dependency relation的预测效果，Figures 1c, 2b, 2d.中绿色的部分。

但是，我们无法简单地得出这些学到的syntanctic attention heads在target language morphology上起到了很大的作用。

5.3 Rare words

在所有model中（不同语言），我们发现1-1的head对于rare words的判断很重要。

6 Pruning Attention Heads

在第5节，主要讨论了被认为是“重要”的head表现出来的特性。而在第6节，则主要针对那些剩余的heads。判断这些heads是否有用。具体是通过pruning attention heads的方法来验证。方法参考了 Louizos et al. (2018 #240 ). 但是原来的方法里是对单个NN weight进行剪枝的，而我们是针对整个components (heads) 进行剪枝的。

6.1 Method

在公式3的基础上，给每个head 乘了一个scalar gate g_i.

Equation (3) turns into

g_i这个参数只和heads有关，与input无关。因为我们打算完全关闭哪些不重要的heads，所以我们对 g_i 使用L0 regularization. L0 norm (只考虑一个向量中不为0的部分。) 这里的用意是用L0来控制g_i，来间接关闭那些不同要的heads。

但是L0 norm是不可微分的，所以无法直接放到regularization term in objective function。于是我们使用 stochastic relaxation (随机松弛): each gate gi is now a random variable drawn independently from a head-specific distribution. （既然是从head-specific distribution里画出来的，为什么是independent呢）。这里使用了Hard Concrete distributions (Louizos et al., 2018),

6.2.1 Quantitative results: BLEU score

剪枝效果很不错。说明encoder只保留一部分head效果也不差。

6.2.2 Functions of retained heads

不同颜色表示不同的function（比如对哪些dependency relation辨别有贡献）。每一列表示经过剪枝后，特定的heads数量。

6.3 Pruning all types of attention heads

6.3.1 Quantitative results: BLEU score

Table 2.

While these results show clearly that the majority of attention heads can be removed from the fully trained model without significant loss in translation quality, it is not clear whether a model can be trained from scratch with such a small number of heads.

6.3.2 Heads importance

下面这部分就不只是encoder里的heads了，还有decoder, decoder-encoder里的heads

BrambleXu / knowledge-graph-learning

ACL-2019/06-Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned #235