Open MaggotHATE opened 3 weeks ago
Author of #8526 here.
Why can this happen?
Basically it should not happen if it worked before.
It's possible the internal changes in batch splits caused some external changes of behavior, but I tried to keep the external behavior identical. I'm curious about what exactly is causing the problem you've noticed.
What can I do to fix it?
My guess is you need to change the default to -1
here: https://github.com/MaggotHATE/Llama_chat/blob/b9a9c2775d61824a8225284f700029dd81390a63/base/sampling.h#L145
(otherwise this diverges from upstream llama.cpp
)
If that's solves it, good! Not sure why it worked before, though.
If the problem is still there, try to find other divergences from upstream llama.cpp
which shouldn't be different (apart from the extra samplers), and then correct them if relevant.
If that still doesn't fix it, the easiest way to debug this is to get a backtrace (with line numbers) from where the problem is detected. If you compile llama.cpp
without NDEBUG
, then there should be a GGML_ASSERT
triggered in llama_get_logits_ith
which should create a backtrace if you have gdb
or lldb
installed.
Then you should be able to guess which batch (from which part of the code) might have caused this and then I'd like to know:
llama_get_logits_ith
?llama_batch_get_one
used to create the problematic batch?-1
should fix the problem)
Hi, @compilade ! Thank you for the answer! Changing default idx to -1
helps, the error no longer occurs. I guess I will have to update my sampling.h/cpp
soon anyway - especially since sampling refactoring PR is ready.
I looked at the change, and it was introduced in https://github.com/ggerganov/llama.cpp/commit/e3c337d87ca650972105a51c6ce302dd236c07ad (5 month ago!) which I probably missed. It looks like the feature was unused before your PR.
Was it supposed to affect all model architectures though? From your PR it reads like a big change, but the name of that commit overly simplifies the changes.
I also see a significant decrease in prompt processing speed for relatively long prompts (400+ characters) after merging ggml.h/cpp
and llama.h/.cpp
. I had to use OpenMP again to fix it. Just to clarify, can new batch splits affect prompt processing? I didn't test long prompts back then because of the runtime error, so I'm not sure if was there too.
Thank you for the answer! Changing default idx to
-1
helps, the error no longer occurs.
That is very good to know!
It looks like the feature was unused before your PR.
That's wierd because I don't really see how the batch splits relate to negative indices in llama_get_logits_ith
. Like I said, I don't know why it worked for you for a while with the default idx
at 0
even when the upsteam code assumes -1
. It should have failed months ago.
Just to clarify, can new batch splits affect prompt processing?
It should not make it slower. For Transformer-based models it avoids copies and uses view into the logical batch to split it into physical batches.
From some tests with extremely small models (with 50k parameters) which make most overhead very measurable, prompt processing did not slow down for heterogenous prompts like in the HellaSwag benchmark. At least not after I made it avoid copies for Transformer-based models (it was already avoiding copies before).
Basically that change was mostly internal refactoring to allow more elaborate batch splits for recurrent models. It also improves how the legacy batch API is handled (e.g. llama_batch_get_one
) by re-using the same buffers instead of re-allocating new ones at each llama_decode
. But the behavior should be the same.
Interesting. To clarify, I currently test it on CPU only, compiled with OpenBLAS. No such issues happened previously, but I disabled OpenMP after Threadpool 2 commit due to slightly slower prompt processing and inference. Now it seems to be required to get adequate processing speeds.
I will keep testing it, maybe the prompt processing issue will fix itself after I update sampling completely. In any case, thanks again for pointing out the root cause.
What happened?
Today I merged latest changes from llama.cpp to my app and suddenly got this error instead of inference:
llama_get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true
After checking multiple past commits I figured out that is was https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd - however, I don't use Mamba models, which is also why I didn't merge this commit earlier.
This issue happens with Nemo and Llama3 (and probably all other models). Previous commits work fine. Why can this happen? What can I do to fix it?
Name and Version
b3614
What operating system are you seeing the problem on?
Windows
Relevant log output
No response