Closed handshape closed 9 months ago
New note from local testing: it's not the size that makes the difference, it's the order. Loading any state that isn't the most-recently saved one causes the assert to fail. What are the odds that the load_state() internals don't set "logits capacity" properly on the underlying llama.cpp context instance?
Any luck working around this issue?
@stduhpf - not as yet. I still don't actually have a 100% repeatable test case. This suggests that the assert failure is affected somehow by something that changes from run to run. The only thing I can think of is the little prayer to RNGsus that gets made at the beginning of a new context.
My next step is to start saving states to disk and seeing if the condition persists across executions.
Thanks for doing the testing! Ping me If you find a solution, I'm wasting quite a bit of energy and time re-evaluating the same prompts over and over because of this issue...
It looks like this issue https://github.com/ggerganov/llama.cpp/issues/3606, but this was closed as completed 2 month ago, so I'm not sure what's going on here....
I ran into this because I updated for the mixtral support. But since my stuff relies heavily on saving and loading states, this is essentially unusable for me and I needed to get an old version. Turns out:
0.2.20: Same problem
0.2.10: Here it's not a GGML assert but an access violation. Fun fact, because of that I can at least handle the error while GGML asserts seem to go unnoticed?
0.2.5: Newest version I checked that worked. Went there after checking 0.2.0 which worked too. Don't feel like checking at what version between .5 and .10 it breaks exactly. Hope it still helps with the fix.
Edit: Did some more testing now that I'm on a faster system:
0.2.7 is the last one that works for me.
0.2.8 is yanked
0.2.9, interestingly, gives no error and loads successfully but then inference just never starts.
I should also note that in my code the state is always saved to disk using pickle before it is loaded again. The issues I'm experiencing are 100% reproducible.
Did a bit more testing and skipped through the code of this and llama.cpp. A few findings:
Testing with llama-cpp-python 0.2.26, I can reload the state of the model if the model that just saved it was not destroyed. This holds true if I run a tiny inference in between, intended to mess with the state. These tests include saving the state to disk in between, using pickle.
While keeping the same process alive, just reloading the same model (obviously with identical settings) causes the following load_state call to fail with GGML_ASSERT: vendor\llama.cpp\llama.cpp:10102: ctx->logits.capacity() == logits_cap
I was not able to identify what exactly between 0.2.7 and 0.2.9 looks like it might have broken this. My theory is that it's changes in llama.cpp that triggered a problem with how llama-cpp-pythons does this. I say this because it's not like llama.cpp's session loading is not working, to my knowledge. I also do not know if this function set was even intended to work the way I am using it, since actual session save/loads seem to be on the todo list (going by # TODO: llama_load_session_file).
When that GGML assert fails, it's a check against something that does not seem to be loaded by load_state. To me it seems that ctx->logits.capacity() is just whatever it currently is for other reasons. So with a model reload these things are not identical (should they?), hence the assert fails, while it works for loading the state into the model that actually just saved it. But again I don't see how 0.2.7, where this works, would have set this up correctly. Maybe the asserts were just not yet in llama.cpp and it somehow, coincidentally worked.
Hope this helps. Maybe supplying the actual session load/save functions would be the fastest way to solve this.
Some learnings from a quick debug and going through working example in original llama.cpp examples.
The examples/simple/simple.cpp
uses new api style for the batch:
llama_batch batch = llama_batch_init(512, 0, 1);
// evaluate the initial prompt
for (size_t i = 0; i < tokens_list.size(); i++) {
llama_batch_add(batch, tokens_list[i], i, { 0 }, false);
}
llama_cpp_python also uses the same style of batch api.
# From : llama_cpp/llama.py
with suppress_stdout_stderr(disable=self.verbose):
self.batch = llama_cpp.llama_batch_init(
self.n_tokens, self.embd, self.n_seq_max
)
where as examples/save-load-state/save-load-state.cpp
uses following old style batch:
llama_decode(ctx, llama_batch_get_one(tokens.data(), tokens.size(), n_past, 0));
The old style batch usage is consistant in terms of the logits' capacity where as new style batch the capacity changes as the decode progress.
For the same tokens, the logits capacity are different in the llama_context if saved immediately after decode, between above two batch style. Causing the GGML_ASSERT: vendor\llama.cpp\llama.cpp:10102: ctx->logits.capacity() == logits_cap
.
During the debug i also found that, the save-load-state does a warmup with BOS+EOS decode with the model. That too causes the capacity of the logits to change from 0 to correct value.
Tried to run the example code from first post above (@handshape) , as is ran without issue, changing the order caused the assert error.
Attaching the log with debug info. EXPECTED CAPACITY
is the capacity of of passed in context.logits vector.
llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /root/Desktop/models/ins-mixtral-8x7b-v0.1.Q4_K.gguf (version GGUF V3 (latest))
...
Llama.save_state: saving llama state
Llama.save_state: got state size: 271573036
Llama.save_state: allocated state
SIZE: 768000, CAPACITY: 768000
Llama.save_state: copied llama state: 6307988
Llama.save_state: saving 6307988 bytes of llama state
Saved Second State
Llama.save_state: saving llama state
Llama.save_state: got state size: 276053036
Llama.save_state: allocated state
SIZE: 1888000, CAPACITY: 1888000
Llama.save_state: copied llama state: 15375648
Llama.save_state: saving 15375648 bytes of llama state
Saved First State
SIZE: 1888000, CAPACITY: 1888000, EXPECTED CAPACITY : 1888000Loaded state.
Tokenized.
Superhero Fred, also known as the Invisible Defender, wakes up to another day in
SIZE: 1888000, CAPACITY: 1888000, EXPECTED CAPACITY : 1888000
Loaded state.
Tokenized.
In his secret identity as Fred, he has a dream of being recognized for his unique superhero persona
SIZE: 1888000, CAPACITY: 1888000, EXPECTED CAPACITY : 1888000
Loaded state.
Tokenized.
In the bustling city of Metropolis, Fred, the unsung hero, wakes
SIZE: 768000, CAPACITY: 768000, EXPECTED CAPACITY : 1888000
GGML_ASSERT: llama.cpp:10409: ctx->logits.capacity() == logits_cap
[New LWP 1908137]
[New LWP 1908138]
[New LWP 1908139]
https://github.com/ggerganov/llama.cpp/pull/4820 looks like it holds promise...
My original test case up top passes as of v0.2.29! If someone else can confirm, I think we can close this issue.
Works for me too! Just after I spent 10 hours implementing what I wanted by calling lama.cpp server API instead 🤡
Calling this one resolved!
Calling this one resolved!
@handshape What can I do in Python to fix this error?
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama-cpp-python
to do.When serializing and deserializing state from a Llama instance, I expect saving and loading to work, regardless of the order of operations, and to emit meaningful exceptions if I break some piece of the usage contract.
Current Behavior
Please provide a detailed written description of what
llama-cpp-python
did, instead.If I create two saved states, and then load the states and start sampling them, loading the smaller of the two states fails with a:
...if the smaller state is saved before the larger state.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
I'm doing all dev work on Ubuntu linux; I've reproduced the issue in WSL2, an Azure VM, and bare metal running both CPU and clBLAS on GPU.
$ lscpu
$ uname -a
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
My example uses the mistralite model because the prompt format is short, but the issue appears on every model I've tried.
As written above, the code works. If the order of creation of the first and second states is reversed, the assert fails when trying to load the second state.
Failure Logs
Success log:
Failure log: