RWKV is a RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Currently with RWKV and DeepSpeed, there seems to be an issue where it "hangs" when activating DeepSpeed with bf16
Specifically around this line
Currently this is tested to be resolved in Cuda 12.2