Why cosformer not work on XL-base transformer architecture?

OpenNLPLab / cosFormer

[ICLR 2022] Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Apache License 2.0

176 stars 25 forks source link

Why cosformer not work on XL-base transformer architecture? #10

Open lwaekfjlk opened 2 years ago

lwaekfjlk commented 2 years ago

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?