Implement optional loading of 2 models which is required by speculative decoding
Also allow switching the main model during normal generation: the client can choose, whether the current request should be:
Generated speculatively using both main and draft models?
Generated using only the main large model without drafting?
Generated using only the small model as if it was a separate one?
Context cache is independent for those two models when the speculative sampling is disabled in Lite, but gets synched on the next speculative request.
Background
I have been using MIQU model (https://huggingface.co/miqudev/miqu-1-70b) for quite a long time many months ago. It is 70b and of course it won't fit in my 3060 GPU with 12 Gb of VRAM.
That model was better than anything else I've ever tried! I wouldn't want to run it heavily quantized to 2 bits because I didn't want to sacrifice its quality, especially because I have 128 Gb of DDR4 RAM.
I could get, like, 1 token/second (or slightly more while the context is short) by activating CuBLAS with 0 offloaded layers.
But later, llama.cpp was updated with new quantization algorithm that hurts performance for older models (that have to be requantized, which is not the case for this stolen/unofficial miqu model). Anyway, it was not that bad even when I continued to play with miqu.
For me it was superior to miqu for every possible task, and it is even less censored.
Unfortunately, such huge model is running 0.3 tokens/second from the empty context and it gets even slower over time…
I tried different attempts to speed it up, but CuBLAS with 0 layers is still the best (and I cannot roll back to the older koboldcpp version because of GGUF format changes, to see if the previous CUDA kernel versions might be faster or not).
Q4 quant instead of Q5 gives a slight improvement: 0.4 tokens/seconds (+0.1 comparing to Q5).
The final speedup was huge! At 3-5 drafted tokes I got doubled speed of 0.85-1.0 token/second!
I think it worth to have it in koboldcpp as well.
What exactly I propose
You need a separate model loader on a dedicated tab in config GUI. There a used can set the device and layers offload strategy. (I don't know technical details, for example how much of configuration have to be equal for both models for the sake of speculative sampling to work; but I imagine you want to get free control for anything that can be different for the draft model).
All critical parameters should be read from the main model, like the context size (at worst the drafting degrades, but will not break the result provided the speculative sampling was implemented correctly).
Add a new dropdown for Lite sampling tab: essentially asking what model to use: default/speculative, large/main, or small/draft. Default means "allow drafting if enabled", while other options would tell koboldcpp to not enable speculative decoding.
When the client asked to not use speculative, koboldcpp proceeds with a normal generation, but with the chosen model (whether it is main or draft).
Looks like you need to deal with two independent context caches, and synchronize them together only when the speculative drafting is requested. Meaning, if one "user" generates with only the main model, the other user can later generate with the draft model without destroying the first user's context cache.
Default to speculative sampling for unknown clients who do not pass the new field over the API, so that they can benefit from the improved speed anyway.
Think of some heuristics for the number of drafted tokens. For example, if the draft model heavily agrees with the main model during several steps, koboldcpp can increment the draft amount (up to the specified value in server config; default to e.g. 8); and otherwise decrement it if the last two drafted tokens were discarded enough times (down the other specified value, defaulting to minimal possible 2).
Try to accept any model as a draft, even if it is not quite compatible. In case it is not possible to use a declared draft model also as a normal model in current llama.cpp – then some upstream changes would be necessary too.
Probably, ContextShift would not work; maybe something else would not work too. Since this mode is completely optional, it won't hurt anyone who don't need it.
Sampling
Only greedy sampling (temp=0 or top_k=1) is straightforward to implement for speculative decoding. Though, some algorithms exist to allow stochastic sampling from several token probabilities (I'm not quite sure how it is implemented in llama.cpp: are they generate a most probable depth tree recursively? Are they just estimating output probabilities, sacrificing the authenticity of main model actual logits?)
Here I suggest to live with whatever is implemented in llama.cpp. Even if only greedy sampling would work correctly – this would be still a huge improvement, because:
Large models are very confident in tokens: the latest update with logits list shows that Mistral Large (and miqu) most of the time returns 100% probabilities even with "sane" sampling parameters (top_p<=0.9, temp<=0.9), so de-facto it tends to behave as if we are already sampling greedily. This can be easily confirmed by multiple retries at any moment, and the model will basically say the same thing each time.
Other than roleplay, sometimes a user might want to "have a local ChatGPT" and asking it questions (other than "come up with a story"). In those cases, it is pretty normal to sample with zero temperature, for many tasks like text translation, summarization, source code generation and answering factual questions.
With a runtime setting to switch to the draft model, the user can just switch and retry many times whenever the story becomes boring! Since the draft model is both fast and smart (to be actually useful to be a drafter for the large one), its genuine answers would not be that bad.
Use cases!
Speculative decoding for improving speed without compromising the quality of output texts, provided the user has extra VRAM/RAM. Basically, you add a correct drafter model and transparently get a noticeable speedup!
Comparation of two unrelated models, ability to switch between them as if they are loaded in two instances of koboldcpp, but more conveniently. You write a story and then change the model to see how different it would be (especially useful to test finetuned versions by comparing their logits in the middle of a good story).
Preserving the context cache when running two stories simultaneously in two Lite tabs, again as if they are loaded in two separated servers.
Performing small tasks like memory summarization or image prompt expansion using the draft model while running a large story (with or without speculation), because the small model should be quick enough to reprocess the whole context, which is not the case for the large model.
I see another improvement that technically will be possible if everything is implemented: the ability to use two cache contexts while still running one model:
The user selects something like "use the draft slot only as a separate context cache" when starting the server
The speculative mechanism is not instantiated
Instead, its separate context cache assigned to the same main model
In the default drafting mode koboldcpp behaves as if the user choses to generate on the main model without speculation
When Lite asks specifically to use the draft model – only its context cache slot is used, without destroying the main one
This would allow to run two stories together with one model in one instance of koboldcpp, which is cheaper than running two instances on different ports
If this idea becomes popular, you may turn in to arbitrary number of contexts instead, totally separating this logic from the speculative stuff
There are two things need to be done: 2 (or more) models and contexts at the same time, and speculative decoding using 2 models.
If you would implement several models – then you can rather easily add the speculative decoding too.
Otherwise, if you would want speculative decoding – you would have to implement loading of several models for this anyway.
Then, when having two models in memory – you can imagine something like "model offloading", or "switching on demand", where a model my be unloaded and replaced with another model at runtime.
But those are future possible improvements, while the speculative decoding is a useful thing by itself!
https://github.com/ggerganov/llama.cpp/pull/2926 https://github.com/ggerganov/llama.cpp/pull/3624 https://github.com/ggerganov/llama.cpp/pull/5625
Feature request
Background
I have been using MIQU model (https://huggingface.co/miqudev/miqu-1-70b) for quite a long time many months ago. It is 70b and of course it won't fit in my 3060 GPU with 12 Gb of VRAM.
That model was better than anything else I've ever tried! I wouldn't want to run it heavily quantized to 2 bits because I didn't want to sacrifice its quality, especially because I have 128 Gb of DDR4 RAM.
I could get, like, 1 token/second (or slightly more while the context is short) by activating CuBLAS with 0 offloaded layers. But later, llama.cpp was updated with new quantization algorithm that hurts performance for older models (that have to be requantized, which is not the case for this stolen/unofficial miqu model). Anyway, it was not that bad even when I continued to play with miqu.
But recently a Mistral Large 2 model came out (https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF) that has 123b parameters!
For me it was superior to miqu for every possible task, and it is even less censored. Unfortunately, such huge model is running 0.3 tokens/second from the empty context and it gets even slower over time…
I tried different attempts to speed it up, but CuBLAS with 0 layers is still the best (and I cannot roll back to the older koboldcpp version because of GGUF format changes, to see if the previous CUDA kernel versions might be faster or not). Q4 quant instead of Q5 gives a slight improvement: 0.4 tokens/seconds (+0.1 comparing to Q5).
After searching information about which model can be used as a draft model for speculative sampling for Mistral Large 2 I decided to try Mistral 7B Instruct v0.3 (https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF)
Strangely enough, llama.cpp has some redundant vocabulary checks (https://github.com/ggerganov/llama.cpp/blob/f018acba22095b8995bf6c5ef815b16a3ce4cf1b/examples/speculative/speculative.cpp#L119-L136). I had to recompile from source with those asserts commented out to make it accept Mistral 7B Instruct v0.3 (as Q5_K_M) as draft model for Mistral Large Instruct 2407 (as Q4_K_S). Also I had to build with full CUDA support for a fair comparation.
The final speedup was huge! At 3-5 drafted tokes I got doubled speed of 0.85-1.0 token/second!
I think it worth to have it in koboldcpp as well.
What exactly I propose
Sampling
Only greedy sampling (temp=0 or top_k=1) is straightforward to implement for speculative decoding. Though, some algorithms exist to allow stochastic sampling from several token probabilities (I'm not quite sure how it is implemented in llama.cpp: are they generate a most probable depth tree recursively? Are they just estimating output probabilities, sacrificing the authenticity of main model actual logits?)
Here I suggest to live with whatever is implemented in llama.cpp. Even if only greedy sampling would work correctly – this would be still a huge improvement, because:
Use cases!
I see another improvement that technically will be possible if everything is implemented: the ability to use two cache contexts while still running one model:
There are two things need to be done: 2 (or more) models and contexts at the same time, and speculative decoding using 2 models. If you would implement several models – then you can rather easily add the speculative decoding too. Otherwise, if you would want speculative decoding – you would have to implement loading of several models for this anyway. Then, when having two models in memory – you can imagine something like "model offloading", or "switching on demand", where a model my be unloaded and replaced with another model at runtime.
But those are future possible improvements, while the speculative decoding is a useful thing by itself!