-
If I let's say have tokens set at 200, why does it fill the entire 200 tokens instead of cutting the reply when it finishes writing an answer? There is an option for single line mode in both koboldcpp…
-
> **Warning**. Complete **all** the fields below. Otherwise your bug report will be **ignored**!
**Have you searched for similar [bugs](https://github.com/SillyTavern/SillyTavern/issues?q=)?**
Yes…
-
### Feature request
I started this issue in TGI but it applies to all inference code that has a form of rep penalty, will paste my feature request notes from there here as well, find original here:…
-
A new format that is intended to be the successor to GGML is nearly ready. Here is a copy of a summery from Reddit:
>*No more breaking changes.
>*Support for non-llama models. (falcon, rwkv, bl…
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [✅] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
Hi
Would love to see vLLM in the future release.
More here:
https://github.com/vllm-project/vllm
Thanks !
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
Hi, folks. Here's the code I'm using:
```
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('/mnt/mydisk/AI/text/models/llama_cpp/gpt4all-lora-quantized-…
-
I'm frequently getting this while attempting to use the Linux version of Koboldcpp, recently cloned as of an hour or so ago. I can successfully connect to the endpoint and enter a prompt. I can also w…
-
Building the latest commit for CUBLAS seems to end in success, but trying to load a model results in the following error:
```
python koboldcpp.py --gpulayers 32 --model /home/alpha/Storage/AIModel…