abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.12k stars 965 forks source link

Speculative sampling #675

Closed andriyanthon closed 9 months ago

andriyanthon commented 1 year ago

llama.cpp added a feature for speculative inference: https://github.com/ggerganov/llama.cpp/pull/2926 but when running llama_cpp.server, it says it does not recognize the new parameters.

There are two new parameters:

  1. -md (model_draft) - the path to the draft model.
  2. -draft (n_draft) - how many tokens to draft each time

Can this new feature please be supported?

abetlen commented 1 year ago

@andriyanthon good idea, I'll take a look into this. I think a similar API to hugginface's Assisted Generation would work well.

Chainfire commented 1 year ago

+1. Would probably double performance in my setup.

galatolofederico commented 1 year ago

+1. It would be very useful

gssci commented 1 year ago

Any updates on this?

LynxPDA commented 1 year ago

+1

Also added the -ngld parameter which tells how many layers to unload in VRAM for Draft model.

On hardware:

For me, the acceleration amounted to phind-codellama-34b-v2.Q4_K_M.gguf

An example of the full CLI in llama.cpp and the results for me are below:

./speculative -m ./models/phind-codellama-34b-v2.Q4_K_M.gguf -md ./models/codellama-7b-instruct.Q3_K_M.gguf  -p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" -e -t 16 -n 256 -c 2048 -s 8 --draft 15 -b 512 **-ngld 35 -ngl 15**

And the results:

encoded   25 tokens in    1.490 seconds, speed:   16.780 t/s
decoded  280 tokens in   29.131 seconds, speed:    9.612 t/s

n_draft   = 27
n_predict = 280
n_drafted = 334
n_accept  = 245
accept    = 73.353%

draft:

llama_print_timings: load time =  538.74 ms
llama_print_timings: sample time =   522.77 ms / 1 runs (  522.77 ms per token, 1.91 tokens per second)
llama_print_timings: prompt eval time =   171.93 ms / 25 tokens  ( 6.88 ms per token, 145.41 tokens per second)
llama_print_timings: eval time =  7165.94 ms / 360 runs ( 19.91 ms per token, 50.24 tokens per second)
llama_print_timings:  total time = 30621.07 ms

target:

llama_print_timings: load time =  1015.12 ms
llama_print_timings: sample time =    91.54 ms / 280 runs ( 0.33 ms per token 3058.81 tokens per second)
llama_print_timings: prompt eval time = 20972.04 ms / 386 tokens  ( 54.33 ms per token, 18.41 tokens per second)
llama_print_timings: eval time =  1603.66 ms / 7 runs  ( 229.09 ms per token 4.37 tokens per second)
llama_print_timings: total time = 31163.91 ms
oobabooga commented 11 months ago

I have made some speculative decoding tests with the following models on my RTX 3090:

With speculative, I get 3.41 tokens/second, while without it I get 2.08 tokens/second. That's a +64% increase.

This is the command that I used:

./speculative \
  -m ../models/wizardlm-70b-v1.0.Q4_K_S.gguf \
  -md ../models/tinyllama-1.1b-chat-v0.3.Q4_K_M.gguf \
  -e \
  -t 6 \
  -tb 12 \
  -n 256 \
  -c 4096 \
  --draft 15 \
  -ngld 128 \
  -ngl 42 \
  -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Give me an example of Python script.\nASSISTANT:"

Having this feature available in llama-cpp-python would be amazing.

rangehow commented 10 months ago

This features looks so cool : ) Looking forward to this!

abetlen commented 9 months ago

1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

Andy1314Chen commented 8 months ago

1120 is almost ready, need to do some more testing and perf benchmarks but it works now with prompt lookup decoding.

This features looks so cool! How can we make it support more speculative decoding? not just prompt lookup decoding.