ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.61k stars 9.71k forks source link

Custom `seed` values ignored by `llama.cpp HTTP server` #7381

Closed mirekphd closed 5 months ago

mirekphd commented 5 months ago

Problem. The custom seed value is not passed to the inference engine when using llama.cpp HTTP server (even though it works as expected in llama_cpp_python package).

How to reproduce: in the latest Linux version of llama.cpp repeat several times exactly the same cURL request to the completion API endpoint of the llama.cpp HTTP server, with the prompt containing an open question and with a high value of temperature and top_p (to maximize the variability of model output), while fixing the seed, e.g. like this one to infer from the 8-bit quant of bartowski/Meta-Llama-3-8B-Instruct-GGUF (Meta-Llama-3-8B-Instruct-Q8_0.gguf) model:

$ curl --request POST --url http://localhost:12345/completion  --data '{"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nwrite a tweet that Elon Musk would write to boost TSLA shares<|eot_id|><|start_header_id|>assistant<|end_header_id|>", "temperature": 0.7, "top_p": 0.8, "repeat_penalty": 1.1, "seed": 42, "n_predict": 2048}' | grep seed

We can see that regardless of the value passed to seed in the HTTP request (e.g. 42 in the example above), the seed values reported to the HTTP client are invariably the default ones (4294967295, i.e. -1 cast to to unsigned int).

The fact that the default -1 (i.e. random, unobservable and non-repeatable seed) is used as the seed, while the custom client-supplied values are being ignored, is corroborated by the fact that the model-generated output is always different, rather than always the same as expected (and as attainable with the above settings when repeating this test against the non-server llama.cpp backend using its Python package - local binding, without client-server communication).

JohannesGaessler commented 5 months ago

There are currently issues with nondeterminism in the server, especially when using >1 slots, see e.g. https://github.com/ggerganov/llama.cpp/pull/7347 .

However, I think that in this case the seed that is being reported back is simply incorrect. When I run

curl --request POST --url http://localhost:8080/completion  --data '{"prompt": "", "temperature": 0.7, "top_p": 0.8, "repeat_penalty": 1.1, "seed": 42, "n_predict": 20}' | python3 -m json.tool

multiple times I get the exact same output but I get different outputs when I don't set the seed.

mirekphd commented 5 months ago

I get the exact same output but I get different outputs when I don't set the seed.

Presumably never trying my test (prompt) or restarting the server after trying it and before trying yours, right?:) Because if your did your prompt immediately after my prompt (not restarting the server), emptying the prompt and setting the seed would not help in achieving deterministic responses for empty prompts - you would get a a different response every time despite the fixed seed. It must be therefore to do with not resetting the prompt history after each inference (as would be intuitively expected and as llama_cpp_python works) - is there a non-default setting to clear the cache that I missed?

I may also add that from my experience even the perfectly deterministic results achievable thanks to fixing the seed in llama_cpp_python can become non-deterministic if the conversation history is not empty and the context window contains a few previous iterations of the perfect deterministic questions and responses:)

JohannesGaessler commented 5 months ago

As I said, there are issues with nondeterminism that I am already aware of. These especially affect long generations like yours where small differences in rounding error will cause the sequences to diverge at points where the distribution of possible continuations is flat, like at the beginnings of sentences.

However, if this was an issue with the seed not being correctly set the sequences would diverge right from the beginning. I can also confirm by just looking at the code that simply the wrong value is being reported back. So there is more than one issue at play here that causes nondeterministic results.

mirekphd commented 5 months ago

Okay, seems like you are right after all. The first few tokens (but not necessarily all 20 - I managed to get several different versions even with that limit:) are always the same if: a) we have empty prompt (or a very short one like "I") and b) we fix the seed (here: at 42)
while they are always different if we change point b) i.e. set the seed to -1 (i.e. allow it to be randomly selected).

I think this issue can be safely closed in favor of yours, but first let's link here the issue(s) where these the non-deterministic responses due to the "butterfly effect" are investigated, shall we? I'd like to chip in my tests there.

JohannesGaessler commented 5 months ago

I don't have a comprehensive list but https://github.com/ggerganov/llama.cpp/pull/7347 , https://github.com/ggerganov/whisper.cpp/issues/1941#issuecomment-1986923227 , and https://github.com/ggerganov/llama.cpp/pull/6122#discussion_r1531405574 are related.

mirekphd commented 5 months ago

Thank you for the links! And it turned out that this: "I'm trying to figure out how to enable deterministic results for >1 slots. " (from https://github.com/ggerganov/llama.cpp/pull/7347 ) could explain fully all reproducibility problems even for my tweet-based tests (I was using multi-processing in the server from my previous scaling tests and multiprocessing was not turned on / not available in the Python package where reproducibility with fixed seeds was never an issue) .

So the current workaround that will give reproducible results for any combination of inference parameters (not restricted to near-zero temperature that some use as another - rather limiting - workaround) is to turn off multi-processing in llama.cpp HTTP server by setting --parallel to 1 (or leave out this argument altogether, as 1 is its current default value (see docs)).

JohannesGaessler commented 5 months ago

Yes, to my knowledge using a single slot should make the results reproducible.