LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

RWKV question #94

Closed Enferlain closed 1 year ago

Enferlain commented 1 year ago

Am I doing this correctly?

D:\textgen\kobold>.\koboldcpp.exe --useclblast 0 0 --smartcontext
Welcome to KoboldCpp - Version 1.10
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast.dll will be required.
Initializing dynamic library: koboldcpp_clblast.dll
For command line arguments, please refer to --help
Otherwise, please manually select ggml file:
Loading model: D:\textgen\oobabooga-windows\text-generation-webui\models\rwkv-4_raven-ggml\ggml-rwkv-4_raven-14b-v9-Eng99%-20230412-ctx8192-Q4_1_0.bin
[Parts: 1, Threads: 15, SmartContext: True]

---
Identified as RWKV model: (ver 300)
Attempting to Load...
---

RWKV Init: State Buffer:4096512, Logit Buffer:201620
Reading vocab from C:\Users\Imi\AppData\Local\Temp\_MEI435042/rwkv_vocab.embd
RWKV Vocab: 50255
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001

It didn't show anything about the model being ggml and 'processing prompt' is going like 1 number/sec

Processing Prompt (57 / 1502 tokens)127.0.0.1 - - [20/Apr/2023 00:57:05] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [20/Apr/2023 00:57:05] "GET /api/v1/model HTTP/1.1" 200 -
Processing Prompt (274 / 1502 tokens)127.0.0.1 - - [20/Apr/2023 00:58:36] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [20/Apr/2023 00:58:36] "GET /api/v1/model HTTP/1.1" 200 -
Processing Prompt (488 / 1502 tokens)127.0.0.1 - - [20/Apr/2023 01:00:07] "GET /api/v1/info/version HTTP/1.1" 200 -
127.0.0.1 - - [20/Apr/2023 01:00:07] "GET /api/v1/model HTTP/1.1" 200 -
Processing Prompt (662 / 1502 tokens)
LostRuins commented 1 year ago

Yes you seem to be doing it correctly, but something (my guess is tavernAI) is spamming the endpoint repeatedly which results in the multiple lines you see. Using the embedded Lite client should avoid that issue.

Otherwise generation should begin once processing is done. 14B Raven is quite a large model and may be a bit slow - I recommend trying a smaller one first.

Enferlain commented 1 year ago

I thought it would be similar to 13B models, but seems like it being RWKV means none of the speed up optimizations work on it, so it's significantly slower than any other model I tried so far. The other ggml models already work fine and are very fast, just wanted to test it out if this did anything special, but doesn't look like it tbh. Other than having bigger context size, but it's too much of a tradeoff when it comes to speed. Like it made my CPU work way harder, and took like 10+ minutes to go through the same context the ggml models do in like a minute.

LostRuins commented 1 year ago

Yes, RWKV implementation is not very optimized now, but I am tracking rwkv.cpp for improvements. If they reach a significant breakthrough I will merge it here.

saharNooby commented 1 year ago

@Enferlain Hi! What do you mean by "none of the speed up optimizations work on it"? Would be great to know what optimizations we can try to apply to rwkv.cpp.

10 minutes to process 1.5K prompt looks plausible to me (400 ms per token). It should be an one-time cost tho, and application then must cache the state (and, preferrably, intermediate states to be able to rollback/edit the prompt; and save these states to disk).

LostRuins commented 1 year ago

@saharNooby I believe what Enferlain is referring to is the usage of OpenBlas to greatly speed up prompt ingestion, processing multiple tokens in a single batch as a massive matrix multiplication.

As RWKV requires previous state to determine the next state, this will require a different approach. BlinkDL mentioned before that it might be possible to parallelize in the Channel dimension, have you considered that?

saharNooby commented 1 year ago

@LostRuins I see, batch processing of the prompt. No, I did not consider this. Intuitively, I don't feel we would win much performance on CPU. But it's just an intuition, I may be very wrong... For example, memory access pattern would be way different and more local, which can indeed increase speed.

Basically, what "GPT-mode" of RWKV does, is that it processes all tokens in the batch layer by layer. Instead of for each token: for each layer: ... in RNN mode, you go for each layer: for each token: .... Then, in att block, channels for each token are independent, so they can be computed in parallel. If you have hunderds of channels (and we have) and hundreds of cores (which we don't have, since it's CPU :( ), this becomes way faster than RNN mode.

It looks like rwkv.cpp would need to be rewritten to support batch processing, which I'm not sure I'm ready to do now.

Enferlain commented 1 year ago

10 minutes to process 1.5K prompt looks plausible to me (400 ms per token). It should be an one-time cost tho, and application then must cache the state (and, preferrably, intermediate states to be able to rollback/edit the prompt; and save these states to disk).

That would make sense with caching, but I'm using it from koboldcpp since this update: "Now supports RWKV models WITHOUT pytorch or tokenizers! Yep, just GGML!"

I just tested again, it wants to process the 1.5k tokens for every reply generation, which is much slower than the other models. Not sure if it's something related to koboldcpp implementation or if this is how it's supposed to work.

LostRuins commented 1 year ago

@Enferlain koboldcpp uses the same eval function as rwkv.cpp, so your results are likely to be similar. We are not using a different backend. We just have some optimizations to reuse the context in certain scenarios when the new context is a continuation of the old one.