RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
11.99k
stars
825
forks
source link
Zero-division error when args.n_layer = 1, caused by ratio_0_to_1. Can I set ratio_0_to_1 = 0 when n_layer = 1? #243
I find that ratio_0_to_1 is defined by: ratio_0_to_1 = layer_id / (args.n_layer - 1)
Then it defines multiple things for time_mix and time_decay.
However, my issue is I want to set args.n_layer = 1 , which would lead to the zero-division error.
Does it make sense to hardcode ratio_0_to_1 = 0 when args.n_layer = l?
Can you intuitively explain what
ratio_0_to_1
is doing inRWKV_Tmix_x060
? https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L290I find that
ratio_0_to_1
is defined by:ratio_0_to_1 = layer_id / (args.n_layer - 1)
Then it defines multiple things fortime_mix
andtime_decay
.However, my issue is I want to set
args.n_layer = 1
, which would lead to the zero-division error. Does it make sense to hardcoderatio_0_to_1 = 0
whenargs.n_layer = l
?