why no activate function?

cmsflash / efficient-attention

An implementation of the efficient attention module.

https://arxiv.org/abs/1812.01243

MIT License

272 stars 26 forks source link

why no activate function? #8

Closed micklexqg closed 1 year ago

micklexqg commented 1 year ago

as in transformer or vanilla attention, there is activate function which is often gelu, but here is not found, so why?

cmsflash commented 1 year ago

The efficient attention module is an attention module, not a Transformer module. An attention module typically does not have an activation function since it is already non-linear. In a Transformer module, the GELU activation is typically in the MLP block.

See:

micklexqg commented 1 year ago

The efficient attention module is an attention module, not a Transformer module. An attention module typically does not have an activation function since it is already non-linear. In a Transformer module, the GELU activation is typically in the MLP block.

See:
* https://flax.readthedocs.io/en/latest/_modules/flax/linen/attention.html#MultiHeadDotProductAttention

* https://github.com/google-research/scenic/blob/abacf900de5c1873b7c0cfcde616ac9eb22693bc/scenic/projects/baselines/bert/layers.py#L197

Thanks, so it is ok to use the efficient attention module (without activation) as a basic module just like convolution+activation in cnn architecture, and we can neglect MLP block and only use attention module when design network? as i am a newer of Transformer and attention, this question maybe a little white.

cmsflash commented 1 year ago

Here is a way to think about it: the efficient attention module, or any non-local attention module, is analogous to a convolutional layer.

If you only need one layer to model long-range dependencies, then you can insert a naked EA module to where you want to do that modeling. However, if you want to build a backbone out of the EA module, then you have to append normalization layers (e.g. BatchNorm, GroupNorm, LayerNorm etc.) and activation layers (e.g. ReLU, GELU, Swish) to the module and form repetitive structures like EA->BN->GELU. It is analogous to the Conv->BN->ReLU structure.

More popularly, you may adopt the bottleneck structure, and do MLP->EA->MLP (each layer should have norm/act following it) just like you do 1x1 conv->3x3 conv->1x1 conv in ResNets. Note that 1x1 conv is equivalent to MLP and EA->MLP-MLP as in Transformer is equivalent to MLP->EA->MLP by shifting where you divide different blocks.

micklexqg commented 1 year ago

@cmsflash , Thanks!