Is token padding or attention_mask needed and supported for RWKV?

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.05k stars 827 forks source link

Is token padding or attention_mask needed and supported for RWKV? #154

Closed zjersey closed 11 months ago

zjersey commented 1 year ago

I read the code of RWKV and found there is no padding ids for input.

How should RWKV model work for those padding tokens when training?

BlinkDL commented 11 months ago

it's not using padding, but you can modify dataset.py to do that