Open flrngel opened 6 years ago
aka Fairseq
https://arxiv.org/pdf/1705.03122.pdf
P for position vector
e for embedding
"Positional encoding" from Attention is all you need
(image from https://norman3.github.io/papers/docs/fairseq.html)
For image above, kernel width is 3, and convolution block stack size is 1
using residual connection from g_i
g_i
attention for dot product z and d_i
z
d_i
This is good for stabilize learning
Convolution Sequence to Sequence Learning
aka Fairseq
https://arxiv.org/pdf/1705.03122.pdf
3. A Convolutional Architecture
3.1. Position Embeddings
P for position vector
e for embedding
See also
"Positional encoding" from Attention is all you need
3.2. Convolutional Block Structure
(image from https://norman3.github.io/papers/docs/fairseq.html)
For image above, kernel width is 3, and convolution block stack size is 1
3.3. Multi-step Attention
using residual connection from
g_i
attention for dot product
z
andd_i
3.4. Normalization Strategy
This is good for stabilize learning