RWKV / rwkv.cpp

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
MIT License
1.38k stars 90 forks source link

rwkv.cpp server #17

Open abetlen opened 1 year ago

abetlen commented 1 year ago

Hi @saharNooby, first off amazing work in this repo, I've been looking for a cpu implementation of RWKV to experiment with using the pre-trained models (don't have a large gpu).

I've put together a basic port of my OpenAI-compatible webserver from llama-cpp-python and tested it on Linux with the your library and the RWKV Raven 3B model in f16, q4_0, and q4_1 (pictured below). Going to try some larger models this weekend to test performance / quality. The cool thing about exposing the model through this server is that It opens the project up to be connected to any OpenAI client (langchain, chatui's, multi-language client libraries).

Let me know if you want me to put a PR to merge this in somewhere and if so the best place to put it.

Cheers, and again great work!

image

saharNooby commented 1 year ago

Hi @abetlen, thanks!

Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of rwkv.cpp. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside from flask), I think it can be straight up put into rwkv directory in this repo.

If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case rwkv.cpp should become a pip package, you can create separate repo for the server, and I link it in README here. I'm not sure how to create a pip package tho, provided that we need to build a C library for it.

Please also don't forget to check out new Q4_1_O format, which will be merged in this PR soon. Q4_0 and Q4_1, inherited from llama.cpp heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.

I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)

abetlen commented 1 year ago

Thanks for the reply @saharNooby

I'll see about putting it into a single file, right now it depends on 3 packages: fastapi (framework), sse_starlette (handle server-sent events), and uvicorn (server). And thank you for sharing that cache implementation I'll be sure to integrate it (something I'm working on for the llama.cpp server actually)!

I think a pip package would be very useful, if you need any help putting that together I'd be happy to assist. I have one for my llama.cpp python bindings and the approach that I took for the server is to distribute using a subpackage (ie pip install llama-cpp-python[server]).

To handle the C library dependency I ended up using scikit-build which has support for building native shared libraries. That way when users do a pip install it builds from source on their system ensuring the proper optimisations are selected. Let me know and I can put together a PR or something to get you going in that regard and gladly share any bugfixes between the projects.

PS: Will definitely check out that new quantization format, thanks!

ss-zheng commented 1 year ago

Have we merged this change yet?

saharNooby commented 1 year ago

@ss-zheng As I know, adding the server requires #21 to be merged, and it is not merged yet.

But there are already OpenAI-compatible REST servers that support rwkv.cpp, like https://github.com/go-skynet/LocalAI

ss-zheng commented 1 year ago

Great thanks for pointing me to it!

alienatorZ commented 1 year ago

I think it would be great to have a straight python server rather than LocalAI's docker build process.

edbarz9 commented 1 year ago

Hi @abetlen, thanks!

Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of rwkv.cpp. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside from flask), I think it can be straight up put into rwkv directory in this repo.

If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case rwkv.cpp should become a pip package, you can create separate repo for the server, and I link it in README here. I'm not sure how to create a pip package tho, provided that we need to build a C library for it.

Please also don't forget to check out new Q4_1_O format, which will be merged in this PR soon. Q4_0 and Q4_1, inherited from llama.cpp heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.

I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)

So, I made this fork on my git server https://git.brz9.dev/ed/rwkv.cpp with this extra flask_server.py in rwkv/.

It can be run with $ python rwkv/flask_server.py --model <path_to_model> --port 5349

Then queried by :

curl -X POST -H "Content-Type: application/json" -d '{"prompt":"Write a hello world program in python", "temperature":0.8, "top_p":0.2, "max_length":250}' http://127.0.0.1:5349/chat

This is still a work in progress but that would be a small addition to the codebase

chymian commented 1 year ago

@saharNooby thank you for your great work. is there any chance to get @abetlen server/API implemented? IMHO that would be a great gain for an easy openAI compatible API, without the burden++ of the LocalAI build process.