Bug: Model ignores system prompt when use `/completion` endpoint

andreys42 commented 2 weeks ago

What happened?

I'm testing the Meta-Llama-3-8B-Instruct-Q8_0 model using the llamacpp HTTP server, both through the chatui interface and direct requests via Python's requests.

When I use chatui with the chatPromptTemplate option, everything works fine, and the model's output is predictable and desirable.

However, when I make direct requests to the same server with the same model, the output is messy (lot of newline characters, repeating of the question, and so on) and most of the system instructions are being ignored (but general logic of ouput is fine), when I ask to answer only with 0 or 1, model still trying to motivate its decision in output

My attempts so far have been:

Use the same template (chatPromptTemplate from chatui) as the prompt key with user requests and assistant answers.
Using the {"chat-template" :"llama3"}
Using the prompt as a raw string of the current user's prompt with the "system_prompt" key as a base for the system instructions.

I've spent a lot of time trying to figure out the issue, but all of these approaches work much worse than using chatui way.

I believe the problem lies in my understanding of how to format the input prompts, and I'm not familiar enough with the syntax documentation.

Name and Version

lastest libs Meta-Llama-3-8B-Instruct-Q8_0

What operating system are you seeing the problem on?

No response

Relevant log output

No response

dspasyuk commented 2 weeks ago

@andreys42 Unless you are using conversation llama-cli -cnv mode you will need to use --in-prefix --in-suffix or wrap your input in Llama3 prompt template.

andreys42 commented 2 weeks ago

@andreys42 Unless you are using conversation llama-cli -cnv mode you will need to use --in-prefix --in-suffix or wrap your input in Llama3 prompt template.

@dspasyuk tnx for suggestion, --in-prefix/--in-suffix indeed make sense, will try, thank you As for using llama3 prompt template for my input, I did that and mention it before, this made no differences for me...

matteoserva commented 2 weeks ago

You are probably using the wrong template.

Send your request to the /completion endpoint, then open the /slots endpoint to see what was effectively sent.

You can compare the good and bad prompts to see what was wrong.

dspasyuk commented 2 weeks ago

@andreys42 here is the setting I use in llama.cui that works well across major models:

../llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct-q5_k_s.gguf --n-gpu-layers 25 -cnv --simple-io -b 2048 --ctx_size 0 --temp 0 --top_k 10 --multiline-input --chat-template llama3 --log-disable

Here is the result:

Screencast from 2024-07-10 10:20:44 AM.webm

You can test it for yourself here: https://github.com/dspasyuk/llama.cui

ggerganov / llama.cpp