BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.47k stars 847 forks source link

Finetuning RWKV-5-World-1B5-v2 model #225

Open ArchanaNarayanan843 opened 7 months ago

ArchanaNarayanan843 commented 7 months ago

How to train RWKV-5-World-1B5-v2 model

BlinkDL commented 7 months ago

--n_layer 32 --n_embd 2560 for 3B --n_layer 24 --n_embd 2048 for 1.5B --n_layer 24 --n_embd 1024 for 0.4B --n_layer 12 --n_embd 768 for 0.1B

For finetuning, when your bsz is very small, I suggest 1e-5 for 3B, 1.5e-5 for 1.5B, 2e-5 for 0.4B, 3e-5 for 0.1B.