question about time_decay initialization

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.05k stars 827 forks source link

question about time_decay initialization #150

Closed gliese581gg closed 11 months ago

gliese581gg commented 1 year ago

Hi.

According to https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/src/model.py ,

It seems like a part of the time_decay is initialized with positive values (upto 3.0), making distant past values more dominant.

Is it intended to learn long term-dependency?

I'm a bit confused because ALiBi ( https://arxiv.org/pdf/2108.12409.pdf ) uses only the negative values for position offset,

and the RWKV preprint also mention that e^(w_(t,i)) <= 1 at the bottom part of the page 3.

BlinkDL commented 11 months ago

i am using e^(-e^x)