Open Nekotekina opened 5 hours ago
UPD: Actually I "fixed" it by setting logits argument to false in common_batch_add, but it still seems strange that it has slowdown effect on unrelated llama_decode call.
That's weird. Enabling logits with all the tokens will cause a reallocation the output buffer, which uses pinned memory if possible. I wonder if that's the reason.
What happened?
I have the following code (roughly) executed at some point for prompt processing: Afterwards, llama_decode for token generation becomes significantly slower (roughly 14t/s against 36t/s). However if this code is replaced by llama_batch_get_one equivalent, performance remains high. I'm not sure why this happens, maybe I use llama batch incorrectly.
Name and Version
~ 4083 (09ecbcb596ed8fa97d503d7440f0b3eff872e8f1) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response