5g4s / paper

0 stars 0 forks source link

RWKV: Reinventing RNNs for the Transformer Era #43

Open 5g4s opened 10 months ago

5g4s commented 10 months ago

https://arxiv.org/abs/2305.13048

5g4s commented 10 months ago

The RWKV architecture is comprised of a series of stacked residual blocks, each formed by a timemixing and a channel-mixing sub-blocks with recurrent structures. image

5g4s commented 10 months ago

The recurrence is formulated both as a linear interpolation between the current input and the input at the previous time step, which can be adjusted independently for every linear projection of the input embedding(R, K, V in time-mixing, and R, K in channel-mixing). Thus RWKV can train parallelized in the time dimension during training

image