BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.48k stars 848 forks source link

Abnormal values in mixing coefficients of token shift #188

Open Triang-jyed-driung opened 1 year ago

Triang-jyed-driung commented 1 year ago

I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening. The point is that some mixing coefficients in token shift are abnormally large. The RWKV paper says

The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the model’s adaptation to sequential data. By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels. 

which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large. This is from the RWKV-4-World-CHNtuned-0.1B model: image image image Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1]. This behavior might eventually lead to gradient explosion, resulting to numerical instability.

Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT. Is it Bo Peng's another invention?

BlinkDL commented 12 months ago

Hi yes TokenShift is invented by me.

larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.

VatsaDev commented 11 months ago

What do you mean by "Sharpen filter" what does that mean for inputs?

Sh1n1ma commented 6 months ago
Snipaste_2024-03-24_20-27-01

I suspect I've encountered a similar issue during training, but it requires further investigation. Above is my training loss. (Note: I simply replaced the transformer block in the MAE task with a VisionRWKV block.)