关于 ViT Transformer Attention 添加 attn_head_size 参数的建议

libertatis commented 2 years ago

在 vit transformer 的实现中（ViT Transformer Attention），多头注意力的 attn_head_size 的计算是由传入的 embed_dim 和 num_heads 计算得到的：

self.attn_head_size = int(embed_dim / self.num_heads)

我认为这里的实现至少有两个问题：

其一，没有对embed_dim是否能num_heads整除做检查。当embed_dim不能被num_heads整除，或者num_heads > embed_dim时，transpose_multihead 的操作会出现异常：
```
def transpose_multihead(self, x):
    new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
    x = x.reshape(new_shape)
    x = x.transpose([0, 2, 1, 3])
    return x
```
其二，attn_head_size 的大小受到 embed_dim 和 num_heads 的限制，当预训练模型时，不能随意设置 attn_head_size 的大小，代码不够灵活。

解决上述问题的办法，就是为 Attention 的 __init__ 方法添加一个 attn_head_size 的参数，这样即不影响现有预训练模型的加载，又可以在预训练时，灵活设置 attn_head_size 的大小。由于 attn_head_size 与输入维度 embed_dim 无关，也不需要验证 embed_dim 是否能被 num_heads 整除。目前主流框架中，两种实现都有：第一种，由 embed_dim 和 num_heads 参数计算 attn_head_size 的实现，包括: PaddlePaddle: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109 PyTorch: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/transformer.py transformers: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L226 第二种，将 attn_head_size 作为参数传入的实现，包括： TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/multi_head_attention.py#L126 TensorFlow Addons: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py 我个人非常推荐第二种实现方式，API 使用起来更加灵活，代码看起来也非常顺畅，更加合理。比如，原实现中 all_head_size 的定义：

self.all_head_size = self.attn_head_size * self.num_heads

all_head_size == embed_dim，完全没有必要定义。这个变量，只在 __init__ ：

        self.qkv = nn.Linear(embed_dim,
                             self.all_head_size*3,  # weights for q, k, and v
                             weight_attr=w_attr_1,
                             bias_attr=b_attr_1 if qkv_bias else False)

和 forward ：

new_shape = z.shape[:-2] + [self.all_head_size]

中用到。__init__ 中的 qkv 映射的输出维度 self.all_head_size*3 可改为 embed_dim*3，forward中的 new_shape 用到的 self.all_head_size，可以在方法的开始，取出输入 x 的维度，修改如下:

embed_dim = x.shape[-1]
……
new_shape = z.shape[:-2] + [embed_dim]

以上是我对源码中定义 self.all_head_size 的质疑。还有最后输出加一层 Linear Layer 的必要性：

        self.out = nn.Linear(embed_dim,
                             embed_dim,
                             weight_attr=w_attr_2,
                             bias_attr=b_attr_2)

在 forward 中，最后输出执行线性映射操作的上面由一行注释 reshape，

        z = z.reshape(new_shape)
        # reshape
        z = self.out(z)

意思应该是将维度映射回输入维度 embed_dim，方面后面的残差连接。不过既然 all_head_size == embed_dim，那何来 reshape? 所以，我认为这里对输出的线性映射是不必要的。不过，如果我们使用第二种方式实现，将 attn_head_size 作为参数传入，不依赖 embed_size 和 num_heads 来计算，以上代码看起来就顺畅多了，合理多了。第二种实现，将 attn_head_size 作为参数传入，只需在源代码基础上更改几行代码即可，实现如下：

from typing import Tuple, Union

import paddle
import paddle.nn as nn
from paddle import ParamAttr
from paddle import Tensor

class Attention(nn.Layer):
    """ Attention module

    Attention module for ViT, here q, k, v are assumed the same.
    The qkv mappings are stored as one single param.

    Attributes:
        num_heads: number of heads
        attn_head_size: feature dim of single head
        all_head_size: feature dim of all heads
        qkv: a nn.Linear for q, k, v mapping
        scales: 1 / sqrt(single_head_feature_dim)
        out: projection of multi-head attention
        attn_dropout: dropout for attention
        proj_dropout: final dropout before output
        softmax: softmax op for attention
    """
    def __init__(self,
                 embed_dim: int,
                 num_heads: int,
                 attn_head_size: int,
                 qkv_bias: Union[bool, ParamAttr],
                 dropout: float = 0.,
                 attention_dropout: float = 0.):
        super().__init__()
        """
        增加了一个attn_head_size的参数，attn_head_size和num_heads的大小不受embed_dim的限制，使API的使用更灵活。
        """
        self.num_heads = num_heads
        # self.attn_head_size = int(embed_dim / self.num_heads)
        self.attn_head_size = attn_head_size
        self.all_head_size = self.attn_head_size * self.num_heads  # Attention Layer's hidden_size

        w_attr_1, b_attr_1 = self._init_weights()
        self.qkv = nn.Linear(embed_dim,
                             self.all_head_size*3,  # weights for q, k, and v
                             weight_attr=w_attr_1,
                             bias_attr=b_attr_1 if qkv_bias else False)

        self.scales = self.attn_head_size ** -0.5

        w_attr_2, b_attr_2 = self._init_weights()
        # self.out = nn.Linear(embed_dim,
        #                      embed_dim,
        #                      weight_attr=w_attr_2,
        #                      bias_attr=b_attr_2)
        # 汇总多头注意力信息，并将维度映射回输入维度embed_dim，方便残差连接
        self.out = nn.Linear(self.all_head_size,
                             embed_dim,
                             weight_attr=w_attr_2,
                             bias_attr=b_attr_2)

        self.attn_dropout = nn.Dropout(attention_dropout)
        self.proj_dropout = nn.Dropout(dropout)
        self.softmax = nn.Softmax(axis=-1)

    def _init_weights(self) -> Tuple[ParamAttr, ParamAttr]:
        weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
        return weight_attr, bias_attr

    def transpose_multihead(self, x: Tensor) -> Tensor:
        new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
        x = x.reshape(new_shape)
        x = x.transpose([0, 2, 1, 3])
        return x

    def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
        qkv = self.qkv(x).chunk(3, axis=-1)
        q, k, v = map(self.transpose_multihead, qkv)

        attn = paddle.matmul(q, k, transpose_y=True)
        attn = attn * self.scales
        attn = self.softmax(attn)
        attn_weights = attn
        attn = self.attn_dropout(attn)

        z = paddle.matmul(attn, v)
        z = z.transpose([0, 2, 1, 3])
        new_shape = z.shape[:-2] + [self.all_head_size]
        z = z.reshape(new_shape)
        # 汇总多头注意力信息，并将维度映射回输入维度embed_dim，方便残差连接
        z = self.out(z)
        z = self.proj_dropout(z)
        return z, attn_weights

测试：

def main():
    t = paddle.randn([4, 16, 96])     # [batch_size, num_patches, embed_dim]
    print('input shape = ', t.shape)

    model = Attention(embed_dim=96,
                      num_heads=8,
                      attn_head_size=128,
                      qkv_bias=False,
                      dropout=0.,
                      attention_dropout=0.)

    print(model)

    out, attn_weights = model(t)
    print(out.shape)
    print(attn_weights.shape)

    for name, param in model.named_parameters():
        print(f'param name: {name},\tparam shape: {param.shape} ')

if __name__ == "__main__":
    main()

输出：

input shape =  [4, 16, 96]
Attention(
  (qkv): Linear(in_features=96, out_features=3072, dtype=float32)
  (out): Linear(in_features=1024, out_features=96, dtype=float32)
  (attn_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
  (proj_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
  (softmax): Softmax(axis=-1)
)
[4, 16, 96]
[4, 8, 16, 16]
param name: qkv.weight, param shape: [96, 3072] 
param name: out.weight, param shape: [1024, 96] 
param name: out.bias,   param shape: [96]

以上是我个人的一点儿不成熟的小建议，望官方评估采纳~

skpig commented 2 years ago

非常感谢您的认真反馈！ @libertatis 我最近在使用ViT的时候同样出现了与反馈完全相同的问题，如果调整了embed_dim必须同时调整num_heads，否则在Attention计算中将出错。以下是我个人的一点看法：

首先，我所见过的绝大部分对于multi-attention的实现中，似乎都把embed_dim和all_head_size值默认等同了（在默认了all_head_size=d_k=d_v的简单情况下）。所以我觉得这段代码主要的问题是，由于除法取整的特点，导致embed_dim != all_head_size，进而导致后面代码出错。 https://github.com/BR-IDL/PaddleViT/blob/a20f3b7d43b38b7a777e3718067114fffda7075b/image_classification/ViT/transformer.py#L113-L114
也即如 @libertatis 所说，attention_head_size * num_heads = embed_dim三者只能显式赋值其中两个，而乘法显然是优于除法的。但是不可忽视的一点，在绝大部分论文里对于参数的指定都是给出embed_dim和num_heads，而不是给出num_heads和attention_head_size。具体的修改实现还需要@xperzy来定夺。

xperzy commented 2 years ago

@libertatis 感谢的详细调研和提出这个issue，也感谢@skpig 参与讨论。我觉得两位说的都没有问题。我认为咱们可以同时照顾两种情况，增加一个head_size（单头的dim）作为传入参数，允许这个参数为None，如果是None，我们使用 embed_dim //num_heads + assert 的方式(类似https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109)，如果不是None，按照传入参数进行计算。 @libertatis 这个你有没有兴趣帮忙实现一下然后提PR？

libertatis commented 2 years ago

@libertatis 感谢的详细调研和提出这个issue，也感谢@skpig 参与讨论。我觉得两位说的都没有问题。我认为咱们可以同时照顾两种情况，增加一个head_size（单头的dim）作为传入参数，允许这个参数为None，如果是None，我们使用 embed_dim //num_heads + assert 的方式(类似https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109)，如果不是None，按照传入参数进行计算。%EF%BC%8C%E5%A6%82%E6%9E%9C%E4%B8%8D%E6%98%AFNone%EF%BC%8C%E6%8C%89%E7%85%A7%E4%BC%A0%E5%85%A5%E5%8F%82%E6%95%B0%E8%BF%9B%E8%A1%8C%E8%AE%A1%E7%AE%97%E3%80%82) @libertatis 这个你有没有兴趣帮忙实现一下然后提PR？

好哒 ^_^。不过我没有提过 PR，具体流程不太熟悉，我先看一下教程。谢谢朱老师的回复~

libertatis commented 2 years ago

@libertatis 感谢的详细调研和提出这个issue，也感谢@skpig 参与讨论。我觉得两位说的都没有问题。我认为咱们可以同时照顾两种情况，增加一个head_size（单头的dim）作为传入参数，允许这个参数为None，如果是None，我们使用 embed_dim //num_heads + assert 的方式(类似https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109)，如果不是None，按照传入参数进行计算。%EF%BC%8C%E5%A6%82%E6%9E%9C%E4%B8%8D%E6%98%AFNone%EF%BC%8C%E6%8C%89%E7%85%A7%E4%BC%A0%E5%85%A5%E5%8F%82%E6%95%B0%E8%BF%9B%E8%A1%8C%E8%AE%A1%E7%AE%97%E3%80%82) @libertatis 这个你有没有兴趣帮忙实现一下然后提PR？

BR-IDL / PaddleViT

关于 ViT Transformer Attention 添加 attn_head_size 参数的建议 #74