llama : add RWKV models support

multimediaconverter commented 1 year ago

RWKV (100% RNN) language model, which is the only RNN (as of now) that can match transformers in quality and scaling, while being faster and saves memory.

Info: https://github.com/BlinkDL/ChatRWKV

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Experimental GGML port: https://github.com/saharNooby/rwkv.cpp

The lastest "Raven"-series Alpaca-style-tuned RWKV 14B & 7B models are very good. Online demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B Download: https://huggingface.co/BlinkDL/rwkv-4-raven

Edit by @ggerganov:

Adding @BlinkDL's comment below to OP for visibility:

v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661

a few remarks:

rwkv models have RNN-style "one" mode, and GPT-style "seq" mode

i am actually using exp(-exp(w))

seems it's good to precompute embedding+emb_layernorm in bf16

when using fp16, i am doing /2 every 6 layers, to avoid overflow

Green-Sky commented 1 year ago

closing this in favor of https://github.com/ggerganov/ggml/issues/21

also https://github.com/saharNooby/rwkv.cpp seems to be it.

someone13574 commented 8 months ago

Now that support for other models is being added directly to llama.cpp, would rwkv support be reconsidered? It would be very nice to support it since support would mean it gets all the benefits that llama.cpp has over a separate project for only rwkv.

ggerganov commented 8 months ago

We should try to add it - it will probably be the most different compared to all other arches that we support as it is LSTM based so it will be a good exercise to see how easy it would fit in the existing framework

BlinkDL commented 8 months ago

@ggerganov Please check these :)

v4 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

v5 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

fast v4 & v5.2 inference: https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

v5.2 1.5B demo (great for its size): https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio

v5.2 1.5B benchmarks: https://twitter.com/BlinkDL_AI/status/1717543614434402661

a few remarks:

rwkv models have RNN-style "one" mode, and GPT-style "seq" mode
i am actually using exp(-exp(w))
seems it's good to precompute embedding+emb_layernorm in bf16
when using fp16, i am doing /2 every 6 layers, to avoid overflow

KerfuffleV2 commented 8 months ago

Not sure if it helps, but I have a GGML-based Rust implementation here: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/graph.rs (that's just v4 inference)

This is actually the reason I made my first contribution to the project, trying to get the map ops (now superseded) to work around what GGML didn't support. I think that's mostly still the case, so the majority of these will probably still need to use custom mapping: https://github.com/KerfuffleV2/smolrsrwkv/blob/main/smolrwkv/src/ggml/map_ops.rs (the one_minus one is mainly just an optimization).

saharNooby commented 8 months ago

Hi all! Maintainer of rwkv.cpp here.

Indeed, having a separate repository for RWKV leads to ggml version lag, lack of computation backends that I can't commit to support with my limited time, and other issues.

That said, I like compactness and simplicity of rwkv.cpp repository; huge repos like llama.cpp with 10K+ lines C++ files scare me; though this is a subjective preference. I would not be able to commit supporting RWKV implementation in llama.cpp repo.

In the end, users will decide :)

On a more practical note:

If support for RWKV will be added into llama.cpp, I also suggest implementing conversion script for handling model files in rwkv.cpp format. The format is documented here. There are models hosted in Hugging Face in this format -- for example, here. kobold.cpp also supports this format.

Furthermore, if support for both RWKV v4 and RWKV v5 is implemented in llama.cpp, including conversion from rwkv.cpp format; and there is a reasonable commitment from maintainers of llama.cpp to fix bugs/add new versions of RWKV, I will be OK to mark rwkv.cpp as deprecated, add a link llama.cpp and stop maintaining the repo.

Until then, my plan is to continue support rwkv.cpp, including adding RWKV v5 support sometime later.

I won't be able to help with migrating rwkv.cpp code to llama.cpp, but of course anyone is free to use rwkv.cpp as a reference (or even copy-paste code -- not sure how licensing works).

ggerganov commented 8 months ago

Hi @saharNooby - great work with rwkv.cpp

I'm mainly interested to see what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to. It turned out that all the LLMs that we support so far are pretty much 99% the same thing with bias here and norm there. So I'm not sure how well the framework would accommodate a model that is fundamentally different, assuming RWKV is one (I haven't even looked in the details, so I don't really know if this statement is true).

I'm looking forward to contributions as I doubt I will have the time to implement it myself. So we will have to see if RWKV support will end up in llama.cpp at all. In any case, it's too early and definitely do not deprecate rwkv.cpp at this point.

Alternatively, we should also look for other LLM architectures that would present some sort of a challenge and try to integrate them as well, in the same spirit to understand what llama.cpp needs to be more general-purpose.

saharNooby commented 8 months ago

what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to

Regarding ggml: for a long time rwkv.cpp have used vanilla ggml, and only recently ggml was forked and a crutch was added to support very large cgraphs: Increase GGML_MAX_NODES from 4096 to 80000. But looks like you've recently removed this node limit altogether. Overall, I don't expect any changes will be required to ggml in order to support RWKV.

Regarding llama.cpp file: looks like I got what you mean -- supporting a new architecture in the file and surrounding infra (scripts, etc.) can indeed be difficult. Can't comment on that :)

that is fundamentally different, assuming RWKV is one

The only difference is that Attention was replaced with WKV, which can be computed in recurrent manner. Everything else -- layer structure, MLP, embed/unembed are same as in Transformers. Some early versions of RWKV even use the popular 20B_tokenizer; although later ones use custom World tokenizer which would need to be implemented (it's simple, does not even require Unicode normalization).

definitely do not deprecate rwkv.cpp at this point

Yep!

BlinkDL commented 8 months ago

I'm mainly interested to see what would llama.cpp need in order to add support for a new arch that is more different compared to what we are used to. It turned out that all the LLMs that we support so far are pretty much 99% the same thing with bias here and norm there. So I'm not sure how well the framework would accommodate a model that is fundamentally different, assuming RWKV is one (I haven't even looked in the details, so I don't really know if this statement is true).

the real difference is RWKV (and other "linear attention" models) uses a fixed-size state instead of a growing kv cache :)

so it's like:

output, new_state = model.forward(input, current_state)

and you can clone & save states, to make a "state cache" for various inputs to accelerate inference.

BlinkDL commented 8 months ago

RWKV v4 in 100 lines (using numpy): https://johanwind.github.io/2023/03/23/rwkv_details.html

another blogpost: https://fullstackdeeplearning.com/blog/posts/rwkv-explainer/

v4 details: https://ben.bolte.cc/rwkv-model

RWKV zoom talk (TUE, NOV 7 · 9:30 AM CST): https://www.meetup.com/silicon-valley-generative-ai/events/296395124/

RWKV sf meet (Saturday, Nov 11 1:00pm PT): https://partiful.com/e/bi6lGCvZXCzZQNN5FjXW

Cyberhan123 commented 8 months ago

I'm excited to see rwkv's progress, I love this model.

KerfuffleV2 commented 8 months ago

Is there a way to make RWKV's state stuff fit in with the current concept of sequences and KV cache manipulation? Can you do parallel generation with multiple independent sequences?

KerfuffleV2 commented 8 months ago

If it's helpful, I asked some questions in the RWKV discord:

[2:06 AM] Kerfuffle: This might be a pretty dumb question, but just thinking about how RWKV could fit into llama.cpp. Probably the biggest thing is figuring out how it can work with llama.cpp's idea of batches and sequences and parallel generation. When doing generation, the API lets you add items to the batch, each one has: token id, sequence id, and position in the sequence. Then you call decode and it can run decode on all the items in the batch in parallel. The API also includes KV cache manipulation stuff, so for example you can undo generation of the last N tokens and that kind of thing. So now the actual question: Can you evaluate multiple independent sequences in parallel with RWKV? And also, can you edit the state kind of like the KV cache stuff when you are able to do something like remove some previously generated tokens from it?

[3:12 AM] Tomeno: you can run rwkv in parallel, but you can't edit the state like that - what you can do though is save and roll back to previous versions of the state cheaply

[3:20 AM] Kerfuffle: Thanks for the answer. Is there a way to save/roll back the state just for specific sequences when doing parallel generation?

[3:30 AM] Tomeno: well, i should say, save and load the state - the state is a "compressed" version of the entire context/sequence up to that point

[3:45 AM] Tomeno: so no, once it's processed, you can't separate the tokens that went into it

[3:46 AM] Tomeno: what you could do is something like save the state after every reply of a chatbot, and then you could load any point in that conversation back up and continue from there

[3:47 AM] Tomeno: or save a number of states to disk and load them back up at any time, no matter how long the input sequence was, the state is about the same size

[3:52 AM] Kerfuffle: Thanks again. I guess the main issue is keeping the state of sequences separate which I guess actually isn't possible.

[3:53 AM] Kerfuffle: Seems like it would be really hard to fit RWKV into llama.cpp as an alternative model architecture.

[4:17 AM] Kerfuffle: I feel like there's got to be a way to do separate sequences in general otherwise it's a HUGE strike against RWKV. Just for example, suppose I have an RWKV model that works as well as ChatGPT. I want to set up a website where people can query it. A service like that requires submitting queries in huge batches, doing a completely separate decode for each individual user just wouldn't work.

[4:20 AM] Tomeno: oh wait, i misunderstood what you meant

[4:20 AM] Tomeno: when you process multiple sequences in parallel, each of them has its own associated state

[4:21 AM] Tomeno: put very simply, the input to rwkv is state + next token

[4:23 AM] Kerfuffle: Ah, okay, good. Yeah, I have a vague idea of how it probably works then.

[4:23 AM] Tomeno: i thought when you wrote "roll back the state for specific sequences" you meant, like, take out a set of tokens from the context

[4:23 AM] Kerfuffle: You could just let each sequence have its own state and somehow do the calculation so the correct state is involved for each sequence.

[4:23 AM] Kerfuffle: You were correct. :) I was actually asking about both things.

[4:24 AM] Kerfuffle: I'm just generally trying to figure out how practical it is (or practical within my capabilities) to try to add RWKV support to llama.cpp

[4:24 AM] Tomeno: there were some demos of parallel inference posted recently though i have no idea how to find it

[4:25 AM] Kerfuffle: Well, the first step is knowing it's even possible, so that definitely helps.

[4:26 AM] Mathmagician: I think web-rwkv lets you inference multiple sequences in parallel

This is the web-rwkv implementation that was mentioned: https://github.com/cryscan/web-rwkv/

From that conversion, it seems like parallel generation wouldn't be too much of a problem. Howevever KV editing operations like rewinding or whatever seem like they would be extremely difficult. Tomeno mentioned saving the RWKV sequence state per token, which may be possible but I'm guessing the per token state is going to be too large to really make that practical. So I think the only way it could really work with how llama.cpp's KV cache manipulation ops work is to only allow completely clearing a sequence and nothing else.

On an unrelated note, a WebGPU backend seems like an interesting idea... web-rwkv uses WebGPU as its GPU backend. It actually ran pretty fast for me when I tried the example, and it probably would be possible to interface with the Rust wgpu crate from C++.

BlinkDL commented 8 months ago

you can save RWKV state per n tokens. and you can save them to ram / hd.

KerfuffleV2 commented 8 months ago

you can save RWKV state per n tokens. and you can save them to ram / hd.

I'm looking at it from the perspective of how it can be integrated into llama.cpp existing architectures. How big is the state? For 3B World5 is it 2560x2560?

BlinkDL commented 8 months ago

(2+64)*2560 numbers for each block

32(2+64)2560 numbers for full model

19h commented 5 months ago

There's been renewed progress in the RWKV space with Eagle-7b: https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers.

sorasoras commented 4 months ago

RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM.

compilade commented 4 months ago

RWKV should reconsider to implement on llama cpp given recent merge of MAMBA SSM.

If nobody else does it, I'll have time to work on RWKV in llama.cpp starting in May (in a month and a half).

Mamba took me a bit more than a month to implement in llama.cpp (but basic inference (with --batch-size 1) had been working after the first week). I expect RWKV will be slightly easier to implement since part of the work has already been thought through (KV cache API compatibility with recurrent models). It would be nice to make simultaneous state processing with recurrent models not require a custom ggml operator for each state type, though. I'll think about ways to make it simpler when I'll get to it.

If anyone reading this is interested in working on this before I have more time, feel free to go ahead.

LaylBongers commented 4 months ago

I've been taking up the task of implementing support for the RWKV 5 architecture. I've had some issues getting the included python conversion code adapted for RWKV, however. Of course, this is the first step to getting RWKV working. I've been working on a conversion tool this week that I'll likely be publishing soon, after which I'll start implementing the architecture within llama.cpp. I'll keep everyone up to date as I'm working on it.

hiepxanh commented 4 months ago

Great to know that 🥰🥰🥰

BlinkDL commented 3 months ago

please try the much stronger v6.0-world 2.1 model :) design similar to v5. 1b6 done, 3b 7b soon

https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1

https://twitter.com/BlinkDL_AI/status/1773503808221712722

@LaylBongers

The difference between v6 and v5: https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo.py vs https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v5_demo.py

LaylBongers commented 3 months ago

Over easter we've got a long weekend here, but I figured I'd give a few updates on my work on this:

I had issues adapting the python conversion code, so I've created and published a toolset to handle RWKV conversion for now. A bit too early and unverified to use just yet practically, but it's on the recursal org. I'll be using this as the basis for RWKV GGUF testing.
I've cloned the repo and started hacking away at it to add support, not much progress there yet.

On RWKV v6, I hadn't seen that demo yet! It looks straightforward to add both once one of the two is working.

LaylBongers commented 3 months ago

I got the tokenizer in and functional so far, working with the "tokenize" example. I'm considering submitting the tokenizer by itself as a small PR, to reduce review load, any thoughts on this?

ggerganov commented 3 months ago

Either way would be fine - the tokenizer alone might not be useful for anything else other than RWKV, so no point in merging it alone

LaylBongers commented 3 months ago

I'm hitting some issues with the vk cache initialization, taking this moment to update on the work done so far.

WIP code available here: https://github.com/RWKV/llama.cpp Containing right now just the tokenizer, and an attempt at placeholder model loading and graph initialization.

This can be tested using a partial generated GGUF over here, generated using gguf-swiss: https://huggingface.co/LaylBongers/temp-rwkvgguf-partial/tree/main

Currently I'm having some issues tracking down an initialization issue:

ggml_backend_alloc_ctx_tensors_from_buft: all tensors in the context are already allocated
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache

compilade commented 3 months ago

I'm hitting some issues with the vk cache initialization

The KV cache for recurrent models is sized from the GGUF metadata keys {model}.ssm.state_size, {model}.ssm.inner_size, and {model}.ssm.kernel_size. These get read into hparams.ssm_d_state, hparams.ssm_d_inner and hparams.ssm_d_conv, respectively.

The following are used to size the kv_self.k_l and kv_self.v_l tensors for recurrent models:

https://github.com/ggerganov/llama.cpp/blob/0d56246f4b9764158525d894b96606f6163c53a8/llama.cpp#L1865-L1875

If RWKV uses 2 different recurrent states (e.g. one for time mix and the other for channel mix, though I'm not yet sure how they are used), it might be useful to add a new metadata key for the stride of the convolution and make it 0 for RWKV (possibly called {model}.ssm.conv_stride). Otherwise, if only a single recurrent state is required, it should be enough to only use {model}.ssm.state_size and {model}.ssm.inner_size and the v_l tensors. I'd like to make it less Mamba-centric, and re-using metadata keys across RWKV and Mamba could achieve this, though it might make hybrids of the two harder in the future (though such hybrids don't seem likely, I think?).

Re-using k_l and v_l for recurrent states isn't ideal and will be changed soon-ish (work-in-progress at https://github.com/ggerganov/llama.cpp/compare/master...compilade/refactor-kv-cache, which will be advancing further once I find more free time) to support hybrid recurrent Transformer models, and so recurrent models will be identified by their use of the relevant metadata keys for the recurrent state size. Parallel sequence management for recurrent models is also slightly simpler in that branch. This is a preview of what is coming next month.

LaylBongers commented 2 months ago

Another update; thank for the notes! I've resolved initial crash issues on initialization, though mostly with hacky temporary placeholders (like re-using ssm scope keys). I'll put up a new version of the temporary GGUF file on Monday. The remainder of the work to be done now is to fill in the rest of the network graph, link it up with the KV cache hack for tracking state, and then start handling all the individual hacks one by one.

BlinkDL commented 1 month ago

More reference: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/rwkv_v6_demo.py https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_v6_demo_cuda_bf16.py

BlinkDL commented 1 month ago

I got the tokenizer in and functional so far, working with the "tokenize" example. I'm considering submitting the tokenizer by itself as a small PR, to reduce review load, any thoughts on this?

please check the unit tests in https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py (v.s. reference tokenizer) and please verify the binary length of each token (must equal the number at the end of each line)

BlinkDL commented 1 week ago

https://github.com/RWKV/rwkv.cpp supports v6 now

ggerganov / llama.cpp

llama : add RWKV models support #846