Closed andriyanthon closed 9 months ago
@andriyanthon good idea, I'll take a look into this. I think a similar API to hugginface's Assisted Generation would work well.
+1. Would probably double performance in my setup.
+1. It would be very useful
Any updates on this?
+1
Also added the -ngld
parameter which tells how many layers to unload in VRAM for Draft model.
On hardware:
For me, the acceleration amounted to phind-codellama-34b-v2.Q4_K_M.gguf
An example of the full CLI in llama.cpp and the results for me are below:
./speculative -m ./models/phind-codellama-34b-v2.Q4_K_M.gguf -md ./models/codellama-7b-instruct.Q3_K_M.gguf -p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" -e -t 16 -n 256 -c 2048 -s 8 --draft 15 -b 512 **-ngld 35 -ngl 15**
And the results:
encoded 25 tokens in 1.490 seconds, speed: 16.780 t/s
decoded 280 tokens in 29.131 seconds, speed: 9.612 t/s
n_draft = 27
n_predict = 280
n_drafted = 334
n_accept = 245
accept = 73.353%
draft:
llama_print_timings: load time = 538.74 ms
llama_print_timings: sample time = 522.77 ms / 1 runs ( 522.77 ms per token, 1.91 tokens per second)
llama_print_timings: prompt eval time = 171.93 ms / 25 tokens ( 6.88 ms per token, 145.41 tokens per second)
llama_print_timings: eval time = 7165.94 ms / 360 runs ( 19.91 ms per token, 50.24 tokens per second)
llama_print_timings: total time = 30621.07 ms
target:
llama_print_timings: load time = 1015.12 ms
llama_print_timings: sample time = 91.54 ms / 280 runs ( 0.33 ms per token 3058.81 tokens per second)
llama_print_timings: prompt eval time = 20972.04 ms / 386 tokens ( 54.33 ms per token, 18.41 tokens per second)
llama_print_timings: eval time = 1603.66 ms / 7 runs ( 229.09 ms per token 4.37 tokens per second)
llama_print_timings: total time = 31163.91 ms
I have made some speculative decoding tests with the following models on my RTX 3090:
wizardlm-70b-v1.0.Q4_K_S.gguf
tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf
With speculative, I get 3.41 tokens/second, while without it I get 2.08 tokens/second. That's a +64% increase.
This is the command that I used:
./speculative \
-m ../models/wizardlm-70b-v1.0.Q4_K_S.gguf \
-md ../models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf \
-e \
-t 6 \
-tb 12 \
-n 256 \
-c 4096 \
--draft 15 \
-ngld 128 \
-ngl 42 \
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Give me an example of Python script.\nASSISTANT:"
Having this feature available in llama-cpp-python would be amazing.
This features looks so cool : ) Looking forward to this!
1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.
This features looks so cool! How can we make it support more speculative decoding? not just prompt lookup decoding.
llama.cpp added a feature for speculative inference: https://github.com/ggerganov/llama.cpp/pull/2926 but when running llama_cpp.server, it says it does not recognize the new parameters.
There are two new parameters:
Can this new feature please be supported?