LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.99k stars 349 forks source link

Kobold UI's openai compatible API not working with runpod (Serverless vLLM) #973

Open morbidCode opened 3 months ago

morbidCode commented 3 months ago

First, I'd like to say that the runpod image you created for koboldcpp is amazing, I'm currently using it. But for larger models, container pods are just too expensive for me, so I am experimenting with serverless.

I found out there is a thing called "Serverless vLLM" where you can deploy models and have an openai compatible API (and this is pay-per-second, only pay for what you use, and is much cheaper).

For experiment, I configured the openchat/openchat-3.5-0106 model using this tutorial (https://docs.runpod.io/serverless/workers/vllm/get-started), and it is working fine. This is just the API though, no UI cause it is serverless.

But then, I found out koboldai.net can handle OpenAI APIs. So I did the following:

And here I got an error.

TypeError: Failed to fetch

I tried manually inputting "openchat/openchat-3.5-0106" on the "use custom" tab, but I got same error when I start generating.

I'm not sure how you fetch the models, but I know that openai has an endpoint that let's you get available models, while runpod/vLLM doesn't have AFAIK. Otherwise I think this should work.

Is there a way to make this work? Say, if we know the model beforehand, just pass the model parameter in the request body and not fetch the model first? IMO this could be perfect, we could deploy large models for much cheeper due to serverless, while getting the benifits of kobold AI (not sure about other frontends like kobold united or silly tavern). I will be happy to help on the JS side.

Thanks!

LostRuins commented 3 months ago

If the endpoint is indeed OpenAI compatible it should definitely work. When you load it, try without the /v1 at the back. Also, you can try with the "chat completions api" if it still doesn't work, like this:

image

morbidCode commented 3 months ago

Hi @LostRuins my vLLM pod is broken now, so I can't try your suggestions. But I think this VLLM approach might not really be a good idea, since this is only openai compatible, so we can't apply parameters such as min_p, mirostat, etc. I think it would be nice to create an endpoint that really accepts koboldcpp request body. Have you thought of making the koboldcpp image serverless instead? Like we add the model and pass in arguments like always, but deploy just the API. Then we could use it in koboldai.net by selecting the "KoboldAI Remote API" dropdown. And since it is really a kobold server, I think streaming could work out of the box (just like my requests here https://github.com/LostRuins/koboldcpp/issues/966). Note that I am always talking about kobold lite UI here, because the kobold UI united is not screen reader friendly so I don't know if what I am saying about kobold lite UI is true there or not. What do you think?

morbidCode commented 3 months ago

Ah, is this even possible? I have experience in AWS lambda, and it is not possible to save state there unless you use something like s3. Even then, the koboldcpp server needs to start at every run ...

LostRuins commented 3 months ago

Oh, the KoboldAI endpoint is available as well. I just listed the OpenAI one as its more popular. You can connect to the KoboldAI one from the provider dropdown box.

image

henk717 commented 2 months ago

@morbidCode I have thought of it and discussed it with the runpod team prior, but its not worth the effort for me. Their API is pretty programming hostile since it forces us to adhere to their own API standards. So I can't make it map the proper KoboldAI API on a serverless endpoint. If they ever introduce proper serverless templates that serve a web proxy instead and detect if someone is trying to access the URL then ill immediately jump on that with a Koboldcpp template since then it would seamlessly support everything we are doing.

Attempted the serverless API and its blocked by cors. Ill have a word with the runpod team on that one.