Open 5g4s opened 10 months ago
The RWKV architecture is comprised of a series of stacked residual blocks, each formed by a timemixing and a channel-mixing sub-blocks with recurrent structures.
The recurrence is formulated both as a linear interpolation between the current input and the input at the previous time step, which can be adjusted independently for every linear projection of the input embedding(R, K, V in time-mixing, and R, K in channel-mixing). Thus RWKV can train parallelized in the time dimension during training
https://arxiv.org/abs/2305.13048