OpenNLPLab / cosFormer

[ICLR 2022] Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention
Apache License 2.0
176 stars 25 forks source link

causal attention not working when q and kv are not in same length #4

Closed zero0kiriyu closed 2 years ago

zero0kiriyu commented 2 years ago

Thank you for your great work! I am currently working on a seq2seq task and I found the causal attention code only works the src_len and the tgt_len are the same. Also, I suggest that you could adopt EPFL's causal linear attention CUDA code to improve the speed of causal attention.

Doraemonzzz commented 2 years ago

Thank you for your suggestion. For the first problem, i think that in causal attention, src_len and tgt_len must be same, this can be verified in the following code: https://github.com/idiap/fast-transformers/blob/master/fast_transformers/causal_product/causal_product_cpu.cpp#L79

For the second problem, we have test EPFL's causal linear attention, but it does not seem to be faster, so for now use the torch version.