Can RWKV beat Flash Attention?

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

11.99k stars 825 forks source link

Can RWKV beat Flash Attention? #235

Open yxchng opened 2 months ago

yxchng commented 2 months ago

I have been experimenting with RWKV v4 and v4neo but somehow it is using much more memory (about 2x) than my LM that uses Flash Attention. Not sure what I am doing wrong. Is this expected?

BlinkDL commented 2 months ago

Try v5 first. What's your model size, bsz, ctxlen