Closed mirekphd closed 5 months ago
There are currently issues with nondeterminism in the server, especially when using >1 slots, see e.g. https://github.com/ggerganov/llama.cpp/pull/7347 .
However, I think that in this case the seed that is being reported back is simply incorrect. When I run
curl --request POST --url http://localhost:8080/completion --data '{"prompt": "", "temperature": 0.7, "top_p": 0.8, "repeat_penalty": 1.1, "seed": 42, "n_predict": 20}' | python3 -m json.tool
multiple times I get the exact same output but I get different outputs when I don't set the seed.
I get the exact same output but I get different outputs when I don't set the seed.
Presumably never trying my test (prompt) or restarting the server after trying it and before trying yours, right?:) Because if your did your prompt immediately after my prompt (not restarting the server), emptying the prompt and setting the seed would not help in achieving deterministic responses for empty prompts - you would get a a different response every time despite the fixed seed. It must be therefore to do with not resetting the prompt history after each inference (as would be intuitively expected and as llama_cpp_python
works) - is there a non-default setting to clear the cache that I missed?
I may also add that from my experience even the perfectly deterministic results achievable thanks to fixing the seed in llama_cpp_python
can become non-deterministic if the conversation history is not empty and the context window contains a few previous iterations of the perfect deterministic questions and responses:)
As I said, there are issues with nondeterminism that I am already aware of. These especially affect long generations like yours where small differences in rounding error will cause the sequences to diverge at points where the distribution of possible continuations is flat, like at the beginnings of sentences.
However, if this was an issue with the seed not being correctly set the sequences would diverge right from the beginning. I can also confirm by just looking at the code that simply the wrong value is being reported back. So there is more than one issue at play here that causes nondeterministic results.
Okay, seems like you are right after all. The first few tokens (but not necessarily all 20 - I managed to get several different versions even with that limit:) are always the same if:
a) we have empty prompt (or a very short one like "I") and
b) we fix the seed (here: at 42)
while they are always different if we change point b) i.e. set the seed to -1 (i.e. allow it to be randomly selected).
I think this issue can be safely closed in favor of yours, but first let's link here the issue(s) where these the non-deterministic responses due to the "butterfly effect" are investigated, shall we? I'd like to chip in my tests there.
I don't have a comprehensive list but https://github.com/ggerganov/llama.cpp/pull/7347 , https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 , and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 are related.
Thank you for the links! And it turned out that this: "I'm trying to figure out how to enable deterministic results for >1 slots. " (from https://github.com/ggerganov/llama.cpp/pull/7347 ) could explain fully all reproducibility problems even for my tweet-based tests (I was using multi-processing in the server from my previous scaling tests and multiprocessing was not turned on / not available in the Python package where reproducibility with fixed seeds was never an issue) .
So the current workaround that will give reproducible results for any combination of inference parameters (not restricted to near-zero temperature
that some use as another - rather limiting - workaround) is to turn off multi-processing in llama.cpp HTTP server
by setting --parallel
to 1 (or leave out this argument altogether, as 1 is its current default value (see docs)).
Yes, to my knowledge using a single slot should make the results reproducible.
Problem. The custom
seed
value is not passed to the inference engine when usingllama.cpp HTTP server
(even though it works as expected inllama_cpp_python
package).How to reproduce: in the latest Linux version of
llama.cpp
repeat several times exactly the same cURL request to the completion API endpoint of thellama.cpp HTTP server
, with the prompt containing an open question and with a high value oftemperature
andtop_p
(to maximize the variability of model output), while fixing theseed
, e.g. like this one to infer from the 8-bit quant ofbartowski/Meta-Llama-3-8B-Instruct-GGUF
(Meta-Llama-3-8B-Instruct-Q8_0.gguf) model:We can see that regardless of the value passed to
seed
in the HTTP request (e.g. 42 in the example above), theseed
values reported to the HTTP client are invariably the default ones (4294967295, i.e. -1 cast to to unsigned int).The fact that the default -1 (i.e. random, unobservable and non-repeatable seed) is used as the seed, while the custom client-supplied values are being ignored, is corroborated by the fact that the model-generated output is always different, rather than always the same as expected (and as attainable with the above settings when repeating this test against the non-server
llama.cpp
backend using its Python package - local binding, without client-server communication).