`llama_decode` is significantly slower if `n_tokens > 1`

apoorvumang commented 8 months ago

Issue

It is expected that llama_decode should take more time if more tokens are present in the batch, but on my system (Apple M1 Max 32GB) with mistral-7b-instruct-v0.2.Q4_0.gguf model, the increase in time taken is quite significant. I plotted some avg latencies on my system with different n_tokens using a modified version of speculative and putting timing around llama_decode(ctx_tgt, batch_tgt);:

There is more 5x jump in latency of llama_decode when n_tokens goes from 1 to 2 (which I feel is too high), but a very gradual increase after that. This means that techniques like speculative and lookup decoding cannot give speed benefits for small draft sizes ( n_draft < 5) even if drafts are 100% correct, since autoregressively decoding 5 tokens 1 at a time is just as fast as decoding 5 tokens at once, so the advantage of speculation is lost.

I'm not sure this counts as a bug or expected behaviour, but the stark difference in latencies b/w 1 token decoding and 2 token decoding seems weird to me. Decoding 2 tokens should at most take 2x the time, not 5x?

To reproduce:

The easiest way to see this is running main with a one word prompt. The prompt eval time will be the time taken for the few prompt tokens, and eval time will show throughput for rest of tokens. e.g. ./main -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -p "A" -n 100 -e gives me

llama_print_timings:        load time =     385.80 ms
llama_print_timings:      sample time =       8.03 ms /   100 runs   (    0.08 ms per token, 12451.75 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.90 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1637.12 ms /    99 runs   (   16.54 ms per token,    60.47 tokens per second)
llama_print_timings:       total time =    1744.09 ms

which shows ~85ms for the initial forward pass with just 2 tokens, and ~16ms for all other tokens.

To see this effect in speculative, one can compare --draft 0 with --draft 1. Use same model as draft model and main model to ensure 100% acceptance. On my system, draft 0 gave better timing of target model than draft 1, which shouldn't really happen IMO

draft = 0 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 0 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 0
n_predict = 101
n_drafted = 0
n_accept  = 0
accept    = nan%

draft:

llama_print_timings:        load time =     982.45 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =      85.60 ms /     2 tokens (   42.80 ms per token,    23.36 tokens per second)
llama_print_timings:        eval time =    1653.63 ms /   101 runs   (   16.37 ms per token,    61.08 tokens per second)
llama_print_timings:       total time =    3453.52 ms

target:

llama_print_timings:        load time =     479.45 ms
llama_print_timings:      sample time =      17.57 ms /   101 runs   (    0.17 ms per token,  5750.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1676.51 ms /   102 runs   (   16.44 ms per token,    60.84 tokens per second)
llama_print_timings:       total time =    4460.08 ms

draft = 1 command:

./speculative \
    -m models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf -md models/7B/mistral-7b-instruct-v0.2.Q4_0.gguf \
    -p "A" \
    -e -ngl 1 -t 4 -n 100 -c 4096 -b 4096 -s 20 --draft 1 -np 1 --temp 0.0 --verbose-prompt --color

Timings:

n_draft   = 1
n_predict = 102
n_drafted = 36
n_accept  = 36
accept    = 100.000%

draft:

llama_print_timings:        load time =     960.89 ms
llama_print_timings:      sample time =     124.45 ms /     1 runs   (  124.45 ms per token,     8.04 tokens per second)
llama_print_timings: prompt eval time =      85.81 ms /     2 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =    1701.90 ms /   102 runs   (   16.69 ms per token,    59.93 tokens per second)
llama_print_timings:       total time =    5584.70 ms

target:

llama_print_timings:        load time =     431.73 ms
llama_print_timings:      sample time =      19.67 ms /   102 runs   (    0.19 ms per token,  5184.77 tokens per second)
llama_print_timings: prompt eval time =    3076.34 ms /    72 tokens (   42.73 ms per token,    23.40 tokens per second)
llama_print_timings:        eval time =     520.40 ms /    31 runs   (   16.79 ms per token,    59.57 tokens per second)
llama_print_timings:       total time =    6569.38 ms

So draft=1 has much slower target model, taking 6.5 sec compared to 4.4 sec if there was no draft model, which is weird.

slaren commented 8 months ago

M3 Max:

./llama-bench -m models/mistral-q4_0.gguf -p 1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,32 -n 0

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 1	63.62 ± 0.54
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 2	25.92 ± 0.09
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 3	38.87 ± 0.15
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 4	51.93 ± 0.06
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 5	64.29 ± 0.11
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 6	77.40 ± 0.31
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 7	89.91 ± 0.14
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 8	102.94 ± 0.26
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 10	127.68 ± 0.30
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 11	141.16 ± 0.22
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 12	155.10 ± 0.26
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 13	166.40 ± 1.12
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 14	179.20 ± 0.48
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 15	191.79 ± 0.71
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 16	204.70 ± 0.76
llama 7B Q4_0	3.83 GiB	7.24 B	Metal	99	pp 32	398.31 ± 0.52

build: b9f4795 (1699)

apoorvumang commented 8 months ago

thx for testing @slaren , I converted your t/s numbers into per call latency, and the pattern seems very similar (the conversion is number of tokens divided by t/s)

kalomaze commented 8 months ago

@ggerganov Is speculation happening on CPU / not parallelized? I hear reports that speculation works great for exllama2, but seems to provide no tangible gains on 70b for llama.cpp. I see significant value in speculation as an inference time optimization, but unfortunately it seems to provide no practical benefit for most users because of the massive latency overhead.

ggerganov commented 8 months ago

@apoorvumang Small-batch decoding with Metal needs some manual adjustments to get the best performance:

https://github.com/ggerganov/llama.cpp/pull/3524#issuecomment-1751749302

I've provided an example for M2 Ultra, but I still don't know how to make this more generic:

https://github.com/ggerganov/llama.cpp/blob/120a1a55154332921c2cab509b282689f940bb93/ggml-metal.m#L1493-L1518

@kalomaze No, the speculative decoding examples utilize the GPU when available and work as expected when you take into account the batched decoding speed

apoorvumang commented 8 months ago

Thx @ggerganov ! I will try to read and understand the metal implementation, and the speed tradeoffs on my system (M1 Max 32GB)

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

`llama_decode` is significantly slower if `n_tokens > 1` #4624

Issue

To reproduce: