ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.9k stars 9.74k forks source link

Bug: Release version less accurate than Debug version consistently #9564

Closed SwamiKannan closed 1 month ago

SwamiKannan commented 1 month ago

What happened?

I created a function-calling multi-agent framework. I am using the llama-server.exe as an inference server and using Nous research's Theta Q4 , Q5 and Q6 models for the LLM. In all these models, I get my function calling perfectly done when I build and run llama.cpp server in Debug mode but in Release mode, it falters a lot. It hallucinates function names and parameters that leads to a lot of parsing errors. I understand that the Release version is far more effecient than the Debug version. So is there a way to get the Release version to mimic the accuracy of the debug version ? I have attached a sample log output but this difference is consistent across multiple function-calls

Name and Version

Windows: Debug and Release server version version: 3673 (8ebe8dde) built with MSVC 19.29.30154.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

Agent ready.....

Hello I am Nikki !
What can I do for you today?
What is the stock price of Reliance Industries ? <This is the user query>

<output on Dubug -> correct solution>
ChatCompletion(id='chatcmpl-4AVQI57WxzZYuOg0NdUgNcWjENPX47sO', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='<tool_call>  {{"name": "stocks_analyst", "arguments": {"name": "Reliance Industries"}}} ', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1726836899, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=26, prompt_tokens=1721, total_tokens=1747, completion_tokens_details=None))

<incorrect output on Release version - There is no function called get_stock_details>

ChatCompletion(id='chatcmpl-HDMQsRH00GT2vDhl9VU8e8Bn1NSAfLEy', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{"arguments": {"name": "Reliance Industries"}, "name": "get_stock_details"}', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1726837186, model='gpt-3.5-turbo', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=20, prompt_tokens=4006, total_tokens=4026, completion_tokens_details=None))
ggerganov commented 1 month ago

How big is your sample size? I.e. in how many examples Debug beats Release?

Also, have you tried running perplexity calc using the 2 builds to see if there is any significant difference?

SwamiKannan commented 1 month ago

Hi Georgi. Will check the perplexity calc and let you know. My observations are empirical. I must have tried about 15 prompts. Release got it right about 2 - 3 times and debug got them right except may be twice.