Closed x4080 closed 4 months ago
You are not using any seed, isn't?
Now I can prove it, here goes
@x4080 -s SEED, --seed SEED
: Set the random number generator (RNG) seed (default: -1, -1 = random seed).
Please remove RNG by enabling seed
, i.e. seed 7
.
Hi, I didnt use any seed, so I should add seed instead ?
Edit : Tried using seed on both, and the results are still not change :
./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>" --seed 1
and
{
"model": "gpt-3.5-turbo",
"temperature": 0.1,
"top_p": 1,
"top_k": 5,
"min_p": 0.1,
"repeat_penalty": 1.1,
"seed":1,
"stream": false,
"messages": [
{
"role": "system",
"content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n \"description\":\"Always use this ask programming or code related queries\",\\n \"parameters\": {\\n {\\n \"request\":\"user's request\"\\n \"type\":\"string\"\\n }\\n }\\n \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
},
{
"role": "user",
"content": "how to display current date in dd/mm/yyyy format using python"
}
]
}
Why do you expect server
to call the gpt-3.5 json?
Maybe try curl --request POST --url http://localhost:8080/completion --data '{"prompt": "whats 5+5?", "temperature": 0, "seed": 1, "n_predict": 128}'
for server
.
the model name is ignored by llama cpp server, I use it because it used to be calling chat gpt api
Ok, today I tried new gguf fix https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF and updated llama cpp and now server and regular llama cpp result is the same without seed, maybe I can close it for now
I think this was closed prematurely, @Jeximo. Simple math questions with only one correct answer and with sampling turned off by zero temperature
were all too easy to yield predictably fixed answers even if the custom seed
value did not get passed to the inference engine (which I believe it wasn't).
Just kindly retake your test for the purpose of e.g. tweet generation instead of math operations (repeated several times with the same prompt and seed) with the exactly opposite (i.e. maximized) temperature
(and possibly also top_p
). Please look into the HTTP client logs as well to see if the seed
is being set to expected values.
@mirekphd, I found out what makes difference results between server and regular llama cpp :
edit : and repeat penalty
@x4080 I think the reason for the model response randomness is even simpler here (in llama.cpp server
): the custom seed when passed to the REST API is not used by the server, which is even seen in the client logs. In contrast, in the Python package llama_cpp_python
(with locally accessed llama,cpp
backend, without the API calls) the deterministic responses (over multiple repeats of the test inference) work correctly - it's sufficient to fix the seed
to get the same results, regardless of temperature
or top_p
settings.
@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0
@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0
Yes, I can confirm your observation of repeat_penalty
being too low (1.0 instead of the expected documented 1.1).
Could you file a bug report for this? And I will report the issue with seed
. My case is arguably easier to prove by its unwanted side effects (i.e. randomness of responses), as I have achieved reproducibility (even for high temperature
and top_p
) by simply fixing the seed
through the Python package (local binding, without client-server communication).
On the other hand, while harder to prove, your finding is arguably more serious, because it affects all users of high-level OpenAI API, where the repeat_penalty
argument is not even exposed by the API, so there is no easy workaround for this, apart from dropping the openai
altogether and switching to some lower-level / generic HTTP client.
And I will report the issue with
seed
.
I did it here: https://github.com/ggerganov/llama.cpp/issues/7381
I think I did an issue weeks ago : #7109
Hi, I dont know if this is a bug or not, Previously I was noticing that answer from server is different than using regular llama.cpp. Now I can prove it, here goes :
First this is using regular llama.cpp (and also output from groq)
testprompt.txt
output :
Here's using server
Here's the json to call
here's the output:
So basically function json is not generated instead it directly write the code
I understand that it seems impossible that result can be different between server and regular llama cpp, but it did happened
PS: I tried also in ollama and it's output is like the server one
Is this a bug ?
Thanks