Question about the normalizer in cuda kernel

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.32k stars 838 forks source link

Hi, I have one tiny question about the cuda kernel. In the code, aa and bb are running sums. To avoid overflow, you divided exp(-p) both when computing y[ii] and updating aa/bb. I can get that the division will not cause a problem when computing y[ii], since the objective remains the same. However, I don't understand why it is safe to divide exp(-p) when updating aa and bb, since the dividing makes the update biased. Is there anything I've missed?

https://github.com/BlinkDL/RWKV-LM/blob/11b606f8725279e002744e104cfe1d2dff5a3d7f/RWKV-v4neo/cuda/wkv_cuda.cu#L29-L39

BlinkDL / RWKV-LM

Question about the normalizer in cuda kernel #91