Closed Botoni closed 6 months ago
If it helps the implementation, using the lmstudio config, the koboldcpp terminal gives this response when sending a prompt:
Processing Prompt [BLAS] (509 / 509 tokens) Generating (10 / 100 tokens) (EOS token triggered!) ContextLimit: 519/2048, Processing:11.37s (22.3ms/T), Generation:1.19s (118.9ms/T), Total:12.56s (0.80T/s) Output: Hello! How can I help you today?
I couldn’t connect to kobaldcpp, I thought support for such a convenient backend would be built in right away How can I connect?
Koboldcpp says its API is OpenAI compatible. But if I configure LocalAI or LM Studio endpoints to point to Koboldcpp, I get the same truncation experience as the OP. Maybe it is a configuration issue in Koboldcpp?
One motivation I can add for Koboldcpp support, other than it being a really convenient and configurable LLM engine, is that it is the only way to get hardware acceleration for older AMD cards that are not officially supported by ROCm (I have an RX 6600).
If its OpenAI Compatible, cant the Generic OpenAI connector (last LLM connector) work here?
Thank you for adding the Koboldcpp connection options. However, can we re-open the issue? The original truncation issue still persists with the latest version of AnythingLLM using the new Koboldcpp connector:
While in the Koboldcpp server logs, I see that the whole message is generated:
...
Input: {"model": "koboldcpp/Meta-Llama-3-8B-Instruct-Q5_K_M", "stream": true, "messages": [{"role": "system", "content": "Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking. Return only your response to the question given the above information following the users instructions as needed."}, {"role": "user", "content": "hello"}], "temperature": 0.7}
Processing Prompt [BLAS] (61 / 61 tokens)
Generating (100 / 100 tokens)
CtxLimit: 161/8192, Process:0.01s (0.2ms/T = 4066.67T/s), Generate:2.88s (28.8ms/T = 34.76T/s), Total:2.89s (34.58T/s)
Output: Hello! How can I assist you today? What's on your mind?
...
@shatfield4
@zacanbot Can you give me any more information on how to replicate this bug? I have downloaded the same Llama3 model you are using and the streaming is working fine for me and showing the entire message inside AnythingLLM. Are you running the latest version of KoboldCPP? Did you change any config settings inside KoboldCPP?
I just updated to the latest version (1.64) and it seems to be working correctly now! Thanks for digging into this. Appreciated 👍
For some reason it’s not working for me again, I just downloaded a new version of AnythingLLM. Kobaldcpp Version 1.65 Doesn't let me select a model, there's an empty window. Apparently something has disappeared again.
Then this likely is because whatever you have put in as the baseURL is not correct. Does http://localhost:5001/v1/models even return data in the browser?
cc @shatfield4
Yes, the browser opens the link http://localhost:5001/v1/models I also pull out the value in Python using the API. Should the base url be "http://localhost:5001/v1"? This path is written in the tooltip.
Yes, the browser opens the link http://localhost:5001/v1/models I also pull out the value in Python using the API. Should the base url be "http://localhost:5001/v1"? This path is written in the tooltip.
Exact same issue for me. Koboldcpp has a few different api options and none of them are loading with AnythingLLM. But other clients, including koboldai, koboldlite, SillyTavern can all use it without issue.
Exact same issue for me. Koboldcpp has a few different api options and none of them are loading with AnythingLLM. But other clients, including koboldai, koboldlite, SillyTavern can all use it without issue.
У меня получилось обойти эту проблему так: вместо http://localhost:5001/v1
поставил http://127.0.0.1/v1
и все заработало
Hi, could be possible to support koboldcpp? It is faster and loads more models than lm studio and has a better compatibility with linux.
In fact, it connects using the lm studio option and writing the koboldcpp address, but responses on the chat get truncated at the first or second word, in the koboldcpp terminal the response is fully generated.