ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.88k stars 9.73k forks source link

Bug: Using llama_batch_init+add+free instead of llama_batch_get_one() permanently slows down llama_decode significantly #10322

Open Nekotekina opened 5 hours ago

Nekotekina commented 5 hours ago

What happened?

I have the following code (roughly) executed at some point for prompt processing: image Afterwards, llama_decode for token generation becomes significantly slower (roughly 14t/s against 36t/s). However if this code is replaced by llama_batch_get_one equivalent, performance remains high. I'm not sure why this happens, maybe I use llama batch incorrectly.

Name and Version

~ 4083 (09ecbcb596ed8fa97d503d7440f0b3eff872e8f1) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Nekotekina commented 5 hours ago

UPD: Actually I "fixed" it by setting logits argument to false in common_batch_add, but it still seems strange that it has slowdown effect on unrelated llama_decode call.

slaren commented 51 minutes ago

That's weird. Enabling logits with all the tokens will cause a reallocation the output buffer, which uses pinned memory if possible. I wonder if that's the reason.