RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Hi.
According to https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py ,
It seems like a part of the time_decay is initialized with positive values (upto 3.0), making distant past values more dominant.
Is it intended to learn long term-dependency?
I'm a bit confused because ALiBi ( https://arxiv.org/pdf/2108.12409.pdf ) uses only the negative values for position offset,
and the RWKV preprint also mention that e^(w_(t,i)) <= 1 at the bottom part of the page 3.