RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
11.99k
stars
825
forks
source link
How to understand `no` variable in cuda code? #234
F no = max(o, u + k[ii]);
F A = exp(o - no);
F B = exp(u + k[ii] - no);
There are many parts of the code that uses the no variable like this. I couldn't understand it as it is not in the equations mentioned in the paper. Why minus no? Can you kindly explain?
e.g.
There are many parts of the code that uses the
no
variable like this. I couldn't understand it as it is not in the equations mentioned in the paper. Why minusno
? Can you kindly explain?