RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.37k stars 90 forks source link

Training, in .cpp, on one machine? #84

Open SCRIER-org opened 1 year ago

SCRIER-org commented 1 year ago

This is a really great package. I'm not yet understanding the training mathematics, however. In order to get a system that integrates with legacy C++ code and runs fast, ideally faster than a python bridge, how easy would it be to slap together a baby trainer demo? Something similar to the (pick one) character-based / word-based tiny-shakespeare / OpenWebText training examples in https://github.com/karpathy/nanoGPT/tree/master/data? This would be really useful and great if it could be done.

saharNooby commented 1 year ago

ggml technically supports training, and it may be possible to support it in rwkv.cpp. Implementing it would require sequence mode implementation, which is not done yet. I think without sequence mode (that is, using naive RNN mode) the training would be too slow.

I myself have no such plans, but contributions are welcome.

SCRIER-org commented 1 year ago

Thanks for your reply.

Confusingly, the paper claims it's able to train in "time-parallel mode" ("see sec 4.2") [but then mentions it needs a "serial scan" to update attention scores wkv. Unclear. How does it work, then?] The abstract claims the model can be "formulated as a Transformer" "which parallelizes computations during training". I interpret this as perhaps meaning that you train the model using a standard Transformer trainer, then run inference afterward using the RWKV RNN system. Evidence for this is the v4/verify.py system is invoking both RWKV_RNN and RWKV_GPT, on ?perhaps the same saved model?, and also the v4/trainer.py run() routine is calling the GPT(GPTConfig)) model to train. I'm not seeing any serial scans in the example trainer routines. No real methods seem laid out, it looks like it just stuffs pytorch_lightning.Trainer and then invokes .fit. Not making much sense yet.

Question: Does RWKV use the same numeric model parameters as a version trained with a GPT backprop, simply modifying how they're used, or does it require its own RNN version of training?

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

Thank you again.

LoganDark commented 1 year ago

he abstract claims the model can be "formulated as a Transformer" "which parallelizes computations during training". I interpret this as perhaps meaning that you train the model using a standard Transformer trainer, then run inference afterward using the RWKV RNN system

That is correct.

RWKV is trained as a transformer model, and then run as an RNN.

Question: Does RWKV use the same numeric model parameters as a version trained with a GPT backprop, simply modifying how they're used, or does it require its own RNN version of training?

GPT vs RWKV is simply different ways of inferencing using the same parameters. That is, every RWKV RNN can be used as a GPT and vice versa. It just depends on the algorithms you run on it.

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

The home page includes "token latency", that is time ms per token. I can't see any comparison with python though

saharNooby commented 1 year ago

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

rwkv.cpp FP32 (slowest mode, no reason to use it over FP16) is roughly the same as PyTorch on CPU. Under the hood PyTorch does all computation on CPU in float32.

Note however that PyTorch on GPU will be significantly faster, given that you can fit the whole model in VRAM.

SCRIER-org commented 1 year ago

@LoganDark, @saharNooby Wow. Very useful. Then I should be able to use a standard trainer, such as the baby-llama, I hope.

How do you reconcile this with the perception that "it needs a serial scan to update attention scores wkv" (4.2), also training "would require sequence mode implementation"? Maybe the serial scan is only for running inference, and not for training, so this statement only applies to RNN inference operation?

pls excuse my ignorance, am still in the process of wrapping my head around the deep dive.

LoganDark commented 1 year ago

How do you reconcile this with the perception that "it needs a serial scan to update attention scores wkv" (4.2)

this is for RNN mode. RNN is run one token at a time so it needs a serial scan to update its perception

also training "would require sequence mode implementation"?

sequence mode is the non-RNN implementation of RWKV that is currently used for training (this is also called sometimes "transformer mode" or "GPT mode"), rwkv.cpp does not currently implement this but might soon

SCRIER-org commented 1 year ago

@LoganDark Useful. Thank you.