ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.57k stars 9.24k forks source link

Bug: The output of the lama-clI is not the same as the output of the lama-server #7973

Closed ztrong-forever closed 1 month ago

ztrong-forever commented 2 months ago

What happened?

run llama-cli:

./bin/llama-cli -m ./models/Meta-Llama-3-8B-Instruct.Q2_K.gguf -n 512 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

run llama-server:

./bin/llama-server -m ./models/Meta-Llama-3-8B-Instruct.Q2_K.gguf -c 2048

Name and Version

llama-cli: version: 3164 (df68d4fa) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

llama-server: version: 3164 (df68d4fa) built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

dspasyuk commented 2 months ago

I have been told that I need to use the specific prompt for instruct models, which I use in my config but it's still does not work with Llama 3 instruct, I am still waiting for reply, see here: https://github.com/ggerganov/llama.cpp/issues/7929#issue-2352272658

ztrong-forever commented 2 months ago

I have been told that I need to use the specific prompt for instruct models, which I use in my config but it's still does not work with Llama 3 instruct, I am still waiting for reply, see here: #7929 (comment)

I would like to know if you have tried to compare the results of llama-cli and llama-server?

dspasyuk commented 2 months ago

@ztrong-forever Llama server seems work fine if you select the right "prompt style", llama3 in this case. llama-cli if ran with small ctx like 512 once context window is filled stop outputting anything, server after context window is filled just print empty line or slashes or other strange things:

Here is how I run server ./llama-server -m ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --gpu-layers 35 -c 512 >> then new UI select llama 3.

Screencast from 2024-06-18 04:17:48 PM.webm

Here is one for cli:

llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --n-gpu-layers 35 -cnv --interactive-first --simple-io --interactive -b 512 --ctx_size 512 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 6 --chat-template llama3

ztrong-forever commented 2 months ago

@ztrong-forever Llama server seems work fine if you select the right "prompt style", llama3 in this case. llama-cli if ran with small ctx like 512 once context window is filled stop outputting anything, server after context window is filled just print empty line or slashes or other strange things:

Here is how I run server ./llama-server -m ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --gpu-layers 35 -c 512 >> then new UI select llama 3.

Screencast.from.2024-06-18.04.17.48.PM.webm Here is one for cli:

llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --n-gpu-layers 35 -cnv --interactive-first --simple-io --interactive -b 512 --ctx_size 512 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 6 --chat-template llama3

Thanks! It works on my side as well!

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.