RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Recent research has shown that more pretraining data can lead to better performance. RWKV has only used the relatively small Pile dataset. Has there been any consideration of using the larger SlimPajama dataset for pretraining and fairly comparing with LLaMA and OpenLLaMA?
Recent research has shown that more pretraining data can lead to better performance. RWKV has only used the relatively small Pile dataset. Has there been any consideration of using the larger SlimPajama dataset for pretraining and fairly comparing with LLaMA and OpenLLaMA?