ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.53k stars 9.57k forks source link

Cache and system prompt on server makes output non-deterministic #4902

Closed Andreybest closed 7 months ago

Andreybest commented 9 months ago

Good day!

I was testing system_prompt field on server, and tried to get the same answer from raw variant (with a system prompt written in prompt) and system_prompt approach. (I suppose that it's a concatenation of system_prompt.prompt + prompt and a caching of system prompt. If somebody can explain system_prompt, will really appreciate this!)

So to make this test I've used temperature - 0 (to get non-random answers), but I got random answers all the time. Then I removed all params that I used before and left with this json:

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\nTell a story about llama[/INST]\n",
    "cache_prompt": true,
    "temperature": 0
}'

On that sh I get random completions. But on the one without "cache_prompt", I don't:

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\nTell a story about llama[/INST]\n",
    "temperature": 0
}'

So cached_prompt is the issue here.

Returning to system_prompt version I get issues both on version with and without cached_prompt.

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "Tell a story about llama[/INST]\n",
    "temperature": 0,
    "system_prompt": {
        "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\n"
    }
}'

On this one (without cache_prompt) I get same completions on second and further times (2nd, 3rd, 4th ...) but it's not the same completion as the first completion.

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "Tell a story about llama[/INST]\n",
    "temperature": 0,
    "cache_prompt": true,
    "system_prompt": {
        "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\n"
    }
}'

On this one I get same behaviour as the one without cache_prompt (1 completion not equals to 2nd, 3rd, 4th ...), but the second and further completions are the completion that is not a story about llama, but a random questions...

So there is issues with both system_prompt and cache_prompt fields.

TL;DR

Was testing to understand how system_prompt works on server.cpp. During testing found issue that cache_prompt makes completions random (with temperature: 0). If system_prompt used (without cache_prompt) 2nd and further completions the same, but differs from 1st. If system_prompt used with cache_prompt, 2nd and further completions the same and do not answer users request, and differs from 1st.

System information

OS: macOS 13.4.0 llama.cpp: sha - e790eef21ce659f5c16d59f8a5c8dcf6cde0692a model: llama 2 7b chat Q6_K (TheBloke) run command:

./server -t 10 -ngl 32 -m "models/llama-2-7b-chat.Q6_K.gguf" -c 4096
aniljava commented 9 months ago

Here are the results of two consecutive invocation of the first curl.

Once upon a time, in the Andes mountains, there was a majestic llama named Llama-Amogus. Llama-Amogus was known far and wide for his beautiful fur, which shone like the brightest star in the night sky. He roamed the mountains with grace and poise, never once faltering on his steady feet.\nOne day, a young adventurer named Maria set out to climb the highest peak in the range. As she ascended, the air grew thinner and the winds grew stronger, but Maria refused to give up. She was determined to reach the summit, no matter what obstacles lay in her path.\nJust as Maria was about to reach the top, a fierce snowstorm rolled in, threatening to send her tumbling down the mountain. But Llama-Amogus, sensing her distress, appeared out of nowhere and offered to carry her to safety. With his strong back and steady gait, he bore Maria through the blinding snow, never once faltering or stumbling.\nAs they reached the summit together, Maria thanked Llama-Amogus for his bravery and kindness. And from that day on, she knew that no matter where her adventures took her, she could always count on Llama-Amogus to be by her side. The end. Amogus!

Once upon a time, in the Andes mountains, there was a majestic llama named Llama-Amogus. Llama-Amogus was known far and wide for his beautiful fur, which shone like the brightest star in the night sky. He roamed the mountains with grace and poise, never once faltering on his steady feet.\nOne day, a young adventurer named Maria set out to climb the highest peak in the range. As she ascended, the air grew thinner and the winds grew stronger, but Maria refused to give up. She was determined to reach the summit, no matter what obstacles lay in her path.\nJust as Maria was about to reach the top, a fierce snowstorm rolled in, threatening to send her tumbling down the mountain. But Llama-Amogus, sensing her distress, appeared out of nowhere and offered to carry her to safety. With his strong back and steady gait, he bore Maria through the blinding storm, never once faltering or stumbling.\nAs they reached the summit together, Maria thanked Llama-Amogus for his bravery and kindness. And from that day on, she knew that no matter where her adventures took her, she could always count on Llama-Amogus to be by her side. The end. Amogus!

Once upon a time, in the Andes mountains, there was a majestic llama named Llama-amogus. Llama-amogus was known far and wide for his beautiful fur, which shone like the brightest star in the night sky. He roamed the mountains with grace and poise, never once faltering on his steady feet.\nOne day, a group of adventurers came to the Andes seeking to find the legendary llama with the golden fleece. They had heard tales of Llama-amogus's magnificence and hoped to bring him back to their kingdom as a gift for their king.\nLlama-amogus, being a kind and noble creature, agreed to accompany the adventurers on their journey. As they trekked through the mountains, he shared stories of his homeland and taught them about the beauty of the Andes.\nBut little did Llama-amogus know, the adventurers had an ulterior motive for bringing him back to their kingdom. They planned to use his golden fleece to make a magical potion that would grant eternal youth to their king.\nLlama-amogus, sensing something was amiss, tried to escape from the adventurers but they were too quick and cunning. He was forced to continue on with them, his heart heavy with worry for his fate.\nAs they reached their destination, Llama-amogus could see the king sitting on his throne, surrounded by his courtiers. The king, seeing the llama, was overjoyed and ordered him to be brought closer. But just as the adventurers were about to hand over Llama-amogus's golden fleece, he suddenly let out a mighty roar and charged at the king, knocking him off his throne with one swift blow.\nThe courtiers were shocked and frightened by the llama's sudden attack, but Llama-amogus simply looked at them with a stern expression and said, \"You should not have tried to harm me or my homeland. I am Llama-amogus, protector of the Andes.\"\nAnd with that, he turned and walked away, leaving the adventurers and the king in a state of confusion and awe. From that day on, Llama-amogus was hailed as a hero in the Andes, and his legend lived on for generations to come.

Once upon a time, in the Andes mountains, there was a majestic llama named Llama-amogus. Llama-amogus was known far and wide for his beautiful fur, which shone like the brightest star in the night sky. He roamed the mountains with grace and poise, never once faltering on his steady feet.\nOne day, a group of adventurers came to the Andes seeking to find the legendary llama with the golden fleece. They had heard tales of Llama-amogus's magnificence and hoped to bring him back to their kingdom as a gift for their king.\nLlama-amogus, being a kind and gentle creature, agreed to join them on their journey. As they trekked through the mountains, he shared stories of his homeland and taught them about the beauty of the Andes.\nBut little did Llama-amogus know, there was a wicked sorcerer who sought to steal his golden fleece for himself. The sorcerer had been watching Llama-amogus from afar, waiting for the perfect moment to strike.\nAs they reached the foot of the final mountain, the sorcerer appeared out of nowhere and cast a spell on Llama-amogus, turning him into a stone statue. The adventurers were shocked and heartbroken, but they knew they had to act fast to save their beloved llama.\nWith all their might, they fought off the sorcerer and broke the spell, restoring Llama-amogus to his former glory. From that day on, he roamed the Andes with even greater pride and joy, knowing that he had friends who would always stand by him.\nAnd so, the story of Llama-amogus spread throughout the land, inspiring generations to come. For in a world filled with danger and uncertainty, there is always hope when there are creatures like Llama-amogus, standing tall and unwavering in their kindness and grace. The end. Amogus!

Two each from first and second curl above.

Andreybest commented 9 months ago

I think this comment relates to this issue: https://github.com/ggerganov/llama.cpp/issues/4103#issuecomment-1825675590

And issue in whole is similar: #4103

ggerganov commented 9 months ago

Please test: https://github.com/ggerganov/llama.cpp/pull/4914

Andreybest commented 9 months ago

@ggerganov Commands:

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "Tell a story about llama[/INST]\n",
    "temperature": 0,
    "system_prompt": {
        "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\n"
    }
}'

and

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "Tell a story about llama[/INST]\n",
    "temperature": 0,
    "cache_prompt": true,
    "system_prompt": {
        "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\n"
    }
}'

Are now giving a deterministic completions, thank you!

But this one, still not:

curl --location 'http://localhost:8080/completion' \
--header 'Content-Type: application/json' \
--data '{
    "prompt": "[INST]\n<<SYS>>\nEnd each answer with a word '\''amogus'\''\n<</SYS>>\n\nTell a story about llama[/INST]\n",
    "cache_prompt": true,
    "temperature": 0
}'

This one now gives same completions for first and second infers, but the following infers I get random completions.

ggerganov commented 9 months ago

Hm, it's always the same response here using Mistral instruct. I guess I'll merge the current fix for now as it seems to be a net positive

city96 commented 9 months ago

I've run into this before as well. I'm still seeing this same issue even with the latest commit, so there's definitely something wrong with the current cache implementation. I think I first noticed this regression around November but I can't be sure.

For me, sending the same request repeatedly seems to progressively deteriorate the reply quality. My test here was a simple chat completion example on LLaMA 70B Q6_K across a V100S and a P40 GPU.

Anything after that will either be 1 token or 300 tokens of randomized garbage.

Server log example ``` Available slots: -> Slot 0 - max context: 4096 {"timestamp":1705169506,"level":"INFO","function":"main","line":2992,"message":"model loaded"} all slots are idle and system prompt is empty, clear the KV cache slot 0 is processing [task id: 0] slot 0 : in cache: 0 tokens | to process: 3670 tokens slot 0 : kv cache rm - [0, end) print_timings: prompt eval time = 47385.77 ms / 3670 tokens ( 12.91 ms per token, 77.45 tokens per second) print_timings: eval time = 16773.98 ms / 85 runs ( 197.34 ms per token, 5.07 tokens per second) print_timings: total time = 64159.75 ms slot 0 released (3755 tokens in cache) {"timestamp":1705169583,"level":"INFO","function":"log_server_request","line":2805,"message":"request","remote_addr":"127.0.0.1","remote_port":50202,"status":200,"method":"POST","path":"/completion","params":{}} slot 0 released (3755 tokens in cache) slot 0 is processing [task id: 2] slot 0 : in cache: 3670 tokens | to process: 0 tokens slot 0 : kv cache rm - [3670, end) slot 0 : we have to evaluate at least 1 token to generate logits print_timings: prompt eval time = 213.11 ms / 0 tokens ( inf ms per token, 0.00 tokens per second) print_timings: eval time = 13178.00 ms / 67 runs ( 196.69 ms per token, 5.08 tokens per second) print_timings: total time = 13391.11 ms slot 0 released (3737 tokens in cache) {"timestamp":1705169598,"level":"INFO","function":"log_server_request","line":2805,"message":"request","remote_addr":"127.0.0.1","remote_port":54738,"status":200,"method":"POST","path":"/completion","params":{}} slot 0 released (3737 tokens in cache) slot 0 is processing [task id: 4] slot 0 : in cache: 3670 tokens | to process: 0 tokens slot 0 : kv cache rm - [3670, end) slot 0 : we have to evaluate at least 1 token to generate logits print_timings: prompt eval time = 217.61 ms / 0 tokens ( inf ms per token, 0.00 tokens per second) print_timings: eval time = 797.61 ms / 5 runs ( 159.52 ms per token, 6.27 tokens per second) print_timings: total time = 1015.22 ms slot 0 released (3675 tokens in cache) {"timestamp":1705169602,"level":"INFO","function":"log_server_request","line":2805,"message":"request","remote_addr":"127.0.0.1","remote_port":56802,"status":200,"method":"POST","path":"/completion","params":{}} slot 0 released (3675 tokens in cache) slot 0 is processing [task id: 6] slot 0 : in cache: 3670 tokens | to process: 0 tokens slot 0 : kv cache rm - [3670, end) slot 0 : we have to evaluate at least 1 token to generate logits print_timings: prompt eval time = 211.25 ms / 0 tokens ( inf ms per token, 0.00 tokens per second) print_timings: eval time = 0.04 ms / 1 runs ( 0.04 ms per token, 25000.00 tokens per second) print_timings: total time = 211.29 ms slot 0 released (3671 tokens in cache) {"timestamp":1705169604,"level":"INFO","function":"log_server_request","line":2805,"message":"request","remote_addr":"127.0.0.1","remote_port":56806,"status":200,"method":"POST","path":"/completion","params":{}} slot 0 released (3671 tokens in cache) slot 0 is processing [task id: 8] slot 0 : in cache: 3670 tokens | to process: 0 tokens slot 0 : kv cache rm - [3670, end) slot 0 : we have to evaluate at least 1 token to generate logits ```

Something interesting to note: changing even a single token in the input between requests fixes it. I've resorted to replacing the last '.' with '..' or '.', depending on what request went out last. This causes at least 3-4 tokens to be evaluated, which somehow fixes the output as well. It's not a great fix but it works as a bandaid for now.

Andreybest commented 7 months ago

Checked generations for "determinisity", and looks like it's all fixed. The only thing that differs is outputs from prompt and prompt + system_prompt, but for me not a big deal.

Thank you very much @ggerganov and @ristew!