Open abetlen opened 1 year ago
Hi @abetlen, thanks!
Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of rwkv.cpp
. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside from flask
), I think it can be straight up put into rwkv
directory in this repo.
If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case rwkv.cpp
should become a pip
package, you can create separate repo for the server, and I link it in README
here. I'm not sure how to create a pip
package tho, provided that we need to build a C library for it.
Please also don't forget to check out new Q4_1_O
format, which will be merged in this PR soon. Q4_0
and Q4_1
, inherited from llama.cpp
heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.
I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)
Thanks for the reply @saharNooby
I'll see about putting it into a single file, right now it depends on 3 packages: fastapi (framework), sse_starlette (handle server-sent events), and uvicorn (server). And thank you for sharing that cache implementation I'll be sure to integrate it (something I'm working on for the llama.cpp server actually)!
I think a pip package would be very useful, if you need any help putting that together I'd be happy to assist. I have one for my llama.cpp python bindings and the approach that I took for the server is to distribute using a subpackage (ie pip install llama-cpp-python[server]
).
To handle the C library dependency I ended up using scikit-build which has support for building native shared libraries. That way when users do a pip install it builds from source on their system ensuring the proper optimisations are selected. Let me know and I can put together a PR or something to get you going in that regard and gladly share any bugfixes between the projects.
PS: Will definitely check out that new quantization format, thanks!
Have we merged this change yet?
@ss-zheng As I know, adding the server requires #21 to be merged, and it is not merged yet.
But there are already OpenAI-compatible REST servers that support rwkv.cpp, like https://github.com/go-skynet/LocalAI
Great thanks for pointing me to it!
I think it would be great to have a straight python server rather than LocalAI's docker build process.
Hi @abetlen, thanks!
Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of
rwkv.cpp
. I've written one for myself, but it's too crude to be open-sourced. If your server is contained in a single file and has almost no dependencies (aside fromflask
), I think it can be straight up put intorwkv
directory in this repo.If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case
rwkv.cpp
should become apip
package, you can create separate repo for the server, and I link it inREADME
here. I'm not sure how to create apip
package tho, provided that we need to build a C library for it.Please also don't forget to check out new
Q4_1_O
format, which will be merged in this PR soon.Q4_0
andQ4_1
, inherited fromllama.cpp
heavily reduce quality of RWKV, so a new format was needed -- it preserves outlier weights better and does not quantize activations. Details are in this issue. I'll also add a simple perplexity measuring script soon, so everyone can verify the results.I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :)
So, I made this fork on my git server https://git.brz9.dev/ed/rwkv.cpp with this extra flask_server.py in rwkv/.
It can be run with $ python rwkv/flask_server.py --model <path_to_model> --port 5349
Then queried by :
curl -X POST -H "Content-Type: application/json" -d '{"prompt":"Write a hello world program in python", "temperature":0.8, "top_p":0.2, "max_length":250}' http://127.0.0.1:5349/chat
This is still a work in progress but that would be a small addition to the codebase
@saharNooby thank you for your great work. is there any chance to get @abetlen server/API implemented? IMHO that would be a great gain for an easy openAI compatible API, without the burden++ of the LocalAI build process.
Hi @saharNooby, first off amazing work in this repo, I've been looking for a cpu implementation of RWKV to experiment with using the pre-trained models (don't have a large gpu).
I've put together a basic port of my OpenAI-compatible webserver from llama-cpp-python and tested it on Linux with the your library and the RWKV Raven 3B model in f16, q4_0, and q4_1 (pictured below). Going to try some larger models this weekend to test performance / quality. The cool thing about exposing the model through this server is that It opens the project up to be connected to any OpenAI client (langchain, chatui's, multi-language client libraries).
Let me know if you want me to put a PR to merge this in somewhere and if so the best place to put it.
Cheers, and again great work!