Open Triang-jyed-driung opened 1 year ago
Hi yes TokenShift is invented by me.
larger than 1 values can work as a "sharpen filter". No it wont cause numerical instability.
What do you mean by "Sharpen filter" what does that mean for inputs?
I suspect I've encountered a similar issue during training, but it requires further investigation. Above is my training loss. (Note: I simply replaced the transformer block in the MAE task with a VisionRWKV block.)
I have posted this issue in Discord a week ago, but no one has yet replied, I don't know exactly what is happening. The point is that some mixing coefficients in token shift are abnormally large. The RWKV paper says
which means that token shift is an interpolation (rather than extrapolation) between the current token and the previous token, therefore mixing coefficients should stay in [0,1]. But some of the coefficients are abnormally large. This is from the RWKV-4-World-CHNtuned-0.1B model: Some numbers go as large as 17, while some goes to -17, but theoretically they are interpolations and should fall in [0,1]. This behavior might eventually lead to gradient explosion, resulting to numerical instability.
Also, I noticed that this token shift trick is not commonly seen in other models, such as LSTM or GPT. Is it Bo Peng's another invention?