RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
https://github.com/BlinkDL/RWKV-LM/blob/666f64591e13c68ed6e602e957c5ca47b25750e3/RWKV-v5/cuda/wkv6state_cuda.cu#L15
This line is missing the batch offset and should read:
_s += b*H*_N_*_N_ + h*_N_*_N_ + i*_N_;
Probably why this code didn't work for BPTT when we tried it a while back!