Abnormal values in mixing coefficients of token shift

Triang-jyed-driung commented 1 year ago

I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening. The point is that some mixing coefficients in token shift are abnormally large. The RWKV paper says

The token shift or time-shift mixing, or (diagonal arrows in Figure 3), also contributes to the model’s adaptation to sequential data. By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels.

which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large. This is from the RWKV-4-World-CHNtuned-0.1B model: Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1]. This behavior might eventually lead to gradient explosion, resulting to numerical instability.

Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT. Is it Bo Peng's another invention?

BlinkDL commented 12 months ago

Hi yes TokenShift is invented by me.

larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.

VatsaDev commented 11 months ago

What do you mean by "Sharpen filter" what does that mean for inputs?

Sh1n1ma commented 6 months ago

I suspect I've encountered a similar issue during training, but it requires further investigation. Above is my training loss. (Note: I simply replaced the transformer block in the MAE task with a VisionRWKV block.)

BlinkDL / RWKV-LM

Abnormal values in mixing coefficients of token shift #188