Bug: runtime error in `llama_get_logits_ith` after `simplify Mamba with advanced batch splits` commit.

MaggotHATE commented 3 weeks ago

What happened?

Today I merged latest changes from llama.cpp to my app and suddenly got this error instead of inference:

llama_get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true

After checking multiple past commits I figured out that is was https://github.com/ggerganov/llama.cpp/commit/a1631e53f6763e17da522ba219b030d8932900bd - however, I don't use Mamba models, which is also why I didn't merge this commit earlier.

This issue happens with Nemo and Llama3 (and probably all other models). Previous commits work fine. Why can this happen? What can I do to fix it?

Name and Version

b3614

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

compilade commented 2 weeks ago

Author of #8526 here.

Why can this happen?

Basically it should not happen if it worked before.

It's possible the internal changes in batch splits caused some external changes of behavior, but I tried to keep the external behavior identical. I'm curious about what exactly is causing the problem you've noticed.

What can I do to fix it?

My guess is you need to change the default to -1 here: https://github.com/MaggotHATE/Llama_chat/blob/b9a9c2775d61824a8225284f700029dd81390a63/base/sampling.h#L145 (otherwise this diverges from upstream llama.cpp)

If that's solves it, good! Not sure why it worked before, though.

If the problem is still there, try to find other divergences from upstream llama.cpp which shouldn't be different (apart from the extra samplers), and then correct them if relevant.

If that still doesn't fix it, the easiest way to debug this is to get a backtrace (with line numbers) from where the problem is detected. If you compile llama.cpp without NDEBUG, then there should be a GGML_ASSERT triggered in llama_get_logits_ith which should create a backtrace if you have gdb or lldb installed.

Then you should be able to guess which batch (from which part of the code) might have caused this and then I'd like to know:

What is calling llama_get_logits_ith?
Is llama_batch_get_one used to create the problematic batch?
Was there more than one token in the batch? (if so, the suggested change of using -1 should fix the problem)
- why did it work before? No idea.
Was there only a single token in the batch?
- I don't see why this would cause the problem you've seen.

MaggotHATE commented 2 weeks ago

Hi, @compilade ! Thank you for the answer! Changing default idx to -1 helps, the error no longer occurs. I guess I will have to update my sampling.h/cpp soon anyway - especially since sampling refactoring PR is ready.

I looked at the change, and it was introduced in https://github.com/ggerganov/llama.cpp/commit/e3c337d87ca650972105a51c6ce302dd236c07ad (5 month ago!) which I probably missed. It looks like the feature was unused before your PR.

Was it supposed to affect all model architectures though? From your PR it reads like a big change, but the name of that commit overly simplifies the changes.

I also see a significant decrease in prompt processing speed for relatively long prompts (400+ characters) after merging ggml.h/cpp and llama.h/.cpp. I had to use OpenMP again to fix it. Just to clarify, can new batch splits affect prompt processing? I didn't test long prompts back then because of the runtime error, so I'm not sure if was there too.

compilade commented 2 weeks ago

Thank you for the answer! Changing default idx to -1 helps, the error no longer occurs.

That is very good to know!

It looks like the feature was unused before your PR.

That's wierd because I don't really see how the batch splits relate to negative indices in llama_get_logits_ith. Like I said, I don't know why it worked for you for a while with the default idx at 0 even when the upsteam code assumes -1. It should have failed months ago.

Just to clarify, can new batch splits affect prompt processing?

It should not make it slower. For Transformer-based models it avoids copies and uses view into the logical batch to split it into physical batches.

From some tests with extremely small models (with 50k parameters) which make most overhead very measurable, prompt processing did not slow down for heterogenous prompts like in the HellaSwag benchmark. At least not after I made it avoid copies for Transformer-based models (it was already avoiding copies before).

Basically that change was mostly internal refactoring to allow more elaborate batch splits for recurrent models. It also improves how the legacy batch API is handled (e.g. llama_batch_get_one) by re-using the same buffers instead of re-allocating new ones at each llama_decode. But the behavior should be the same.

MaggotHATE commented 2 weeks ago

Interesting. To clarify, I currently test it on CPU only, compiled with OpenBLAS. No such issues happened previously, but I disabled OpenMP after Threadpool 2 commit due to slightly slower prompt processing and inference. Now it seems to be required to get adequate processing speeds.

I will keep testing it, maybe the prompt processing issue will fix itself after I update sampling completely. In any case, thanks again for pointing out the root cause.

ggerganov / llama.cpp