ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.16k stars 9.18k forks source link

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

Closed x4080 closed 4 months ago

x4080 commented 4 months ago

Hi, I dont know if this is a bug or not, Previously I was noticing that answer from server is different than using regular llama.cpp. Now I can prove it, here goes :

First this is using regular llama.cpp (and also output from groq)

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>"

testprompt.txt

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
<instruction>
-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :
{"function":"<function name you want to call>", params:<parameters for the function>}
-Dont answer user's query yourself, always use agent calls
-You can only call functions provided below :
<functions>
- ask_expertcoder_agent : {
    "description":"Always use this ask programming or code related queries",
    "parameters": {
        {
            "request":"user's request"
            "type":"string"
        }
    }
    "required":["request"]
<functions/>
<instruction/>

<chatHistory>
</chatHistory>
<|eot_id|><|start_header_id|>user<|end_header_id|>
how to display current date in dd/mm/yyyy format using python
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

output :

{"function":"ask_expertcoder_agent", "params": {"request": "How to display current date in dd/mm/yyyy format using Python?"}}

Here's using server

./build/bin/./server -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -c 4096 -ngl 99 -t 1 --chat-template llama3 

Here's the json to call

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself, always use agent calls\\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}

here's the output:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "You can use the `datetime` module in Python to display the current date in dd/mm/yyyy format. Here's an example:\n```\nimport datetime\n\ncurrent_date = datetime.date.today()\nformatted_date = current_date.strftime(\"%d/%m/%Y\")\n\nprint(formatted_date)\n```\nThis will output the current date in the format dd/mm/yyyy.\n\nHere's a breakdown of how it works:\n\n* `datetime.date.today()` returns the current date as a `date` object.\n* `strftime()` is a method that formats the date object into a string. The `%d/%m/%Y` format code specifies the desired output format:\n\t+ `%d`: day of the month (01-31)\n\t+ `%m`: month (01-12)\n\t+ `%Y`: year (in four digits)\n\nBy combining these format codes, you get the desired output format: dd/mm/yyyy.\n\nYou can also use the `datetime.datetime.now()` function to get the current date and time, if needed:\n```\nimport datetime\n\ncurrent_datetime = datetime.datetime.now()\nformatted_date = current_datetime.strftime(\"%d/%m/%Y %H:%M:%S\")\n\nprint(formatted_date)\n```\nThis will output the current date and time in the format dd/mm/yyyy HH:MM:SS.",
        "role": "assistant"
      }
    }
  ],
  "created": 1714266899,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 263,
    "prompt_tokens": 186,
    "total_tokens": 449
  },
  "id": "chatcmpl-pphq2J7AlzKpo5ynIdbsQ74EPm9VadRi"
}

So basically function json is not generated instead it directly write the code

I understand that it seems impossible that result can be different between server and regular llama cpp, but it did happened

PS: I tried also in ollama and it's output is like the server one

Is this a bug ?

Thanks

phymbert commented 4 months ago

You are not using any seed, isn't?

Jeximo commented 4 months ago

Now I can prove it, here goes

@x4080 -s SEED, --seed SEED: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

Please remove RNG by enabling seed, i.e. seed 7.

x4080 commented 4 months ago

Hi, I didnt use any seed, so I should add seed instead ?

Edit : Tried using seed on both, and the results are still not change :

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>" --seed 1

and

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "seed":1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}
Jeximo commented 4 months ago

Why do you expect server to call the gpt-3.5 json?

Maybe try curl --request POST --url http://localhost:8080/completion --data '{"prompt": "whats 5+5?", "temperature": 0, "seed": 1, "n_predict": 128}' for server.

x4080 commented 4 months ago

the model name is ignored by llama cpp server, I use it because it used to be calling chat gpt api

x4080 commented 4 months ago

Ok, today I tried new gguf fix https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF and updated llama cpp and now server and regular llama cpp result is the same without seed, maybe I can close it for now

mirekphd commented 3 months ago

I think this was closed prematurely, @Jeximo. Simple math questions with only one correct answer and with sampling turned off by zero temperature were all too easy to yield predictably fixed answers even if the custom seed value did not get passed to the inference engine (which I believe it wasn't).

Just kindly retake your test for the purpose of e.g. tweet generation instead of math operations (repeated several times with the same prompt and seed) with the exactly opposite (i.e. maximized) temperature (and possibly also top_p). Please look into the HTTP client logs as well to see if the seed is being set to expected values.

x4080 commented 3 months ago

@mirekphd, I found out what makes difference results between server and regular llama cpp :

edit : and repeat penalty

mirekphd commented 3 months ago

@x4080 I think the reason for the model response randomness is even simpler here (in llama.cpp server ): the custom seed when passed to the REST API is not used by the server, which is even seen in the client logs. In contrast, in the Python package llama_cpp_python (with locally accessed llama,cpp backend, without the API calls) the deterministic responses (over multiple repeats of the test inference) work correctly - it's sufficient to fix the seed to get the same results, regardless of temperature or top_p settings.

x4080 commented 3 months ago

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

mirekphd commented 3 months ago

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

Yes, I can confirm your observation of repeat_penalty being too low (1.0 instead of the expected documented 1.1).

Could you file a bug report for this? And I will report the issue with seed. My case is arguably easier to prove by its unwanted side effects (i.e. randomness of responses), as I have achieved reproducibility (even for high temperature and top_p) by simply fixing the seed through the Python package (local binding, without client-server communication).

On the other hand, while harder to prove, your finding is arguably more serious, because it affects all users of high-level OpenAI API, where the repeat_penalty argument is not even exposed by the API, so there is no easy workaround for this, apart from dropping the openai altogether and switching to some lower-level / generic HTTP client.

mirekphd commented 3 months ago

And I will report the issue with seed.

I did it here: https://github.com/ggerganov/llama.cpp/issues/7381

x4080 commented 3 months ago

I think I did an issue weeks ago : #7109