ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.98k stars 9.61k forks source link

Bug: Got meaningless output when set -j {}. #9934

Open morgen52 opened 1 week ago

morgen52 commented 1 week ago

What happened?

Hi! Thanks for your efforts for contributing such a great framework! I am working on deploying a custom service on my PC and learning to make llama.cpp produce structured output via -j config. However, when I use the -j {} config as a start point, I get meaningless output. I am not sure if I am doing something wrong or if there is a bug in the code. I would appreciate it if you could help me with this issue.

Name and Version

./llama.cpp-b3938/build_gpu/bin/llama-server --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes version: 7 (d9a33c5) built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

{"content":"{ \"){  console.log(\" :{} } ","id_slot":0,"stop":true,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","tokens_predicted":13,"tokens_evaluated":7,"generation_settings":{"n_ctx":8192,"n_predict":-1,"model":"../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf","seed":4294967295,"seed_cur":287989699,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"array ::= \"[\" space ( value (\",\" space value)* )? \"]\" space\nboolean ::= (\"true\" | \"false\") space\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nnull ::= \"null\" space\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nobject ::= \"{\" space ( string \":\" space value (\",\" space string \":\" space value)* )? \"}\" space\nroot ::= object\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\nvalue ::= object | array | string | number | boolean | null\n","samplers":["top_k","tfs_z","typ_p","top_p","min_p","xtc","temperature"]},"prompt":"What is the meaning of life?","has_new_line":false,"truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":19,"timings":{"prompt_n":7,"prompt_ms":195.664,"prompt_per_token_ms":27.951999999999998,"prompt_per_second":35.77561534058386,"predicted_n":13,"predicted_ms":1195.813,"predicted_per_token_ms":91.98561538461539,"predicted_per_second":10.871264988756602},"index":0}
ggerganov commented 1 week ago

Well, it doesn't have much context so the output is kind of expected.

You can be more specific in order to help the LLM understand what you want, for example:

curl \
    --request POST --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "A pokemon character: name, skill, strength.\n\n","n_predict": 128}' | jq -r .content
{
  "name": "Pikachu",
  "skill": "Electric Attack",
  "strength": 10
} 

Also, for such kind of applications, you should not use instruction-tuned models as they are trained for a specific chat template. Instead, use a base model.

morgen52 commented 1 week ago

Thanks for your great help. After trying your suggestion, I got the same output as you showed. However, when I try to generate multiple Pokémon characters (e.g., 10), it only shows the first one. How can I receive a JSON response with all 10 characters at once? Thanks again for your assistance!

I'm confident this isn't a model issue because when I remove the -j option, the model consistently provides 10 outputs (as shown below), though they aren't in JSON format.

curl     --request POST --url http://localhost:8080/completion     --header "Content-Type: application/json" --data '{"prompt": "Please generate 10 pokemon characters: name, skill, strength, height, weight.\n\n"}' | jq -r .content
Here are the 10 Pokémon characters:

1. **Name:** Embermoth
**Skill:** Fire-type
**Strength:** 85
**Height:** 3.5 feet
**Weight:** 22 pounds

2. **Name:** Aquaflame
**Skill:** Water-type
**Strength:** 90
**Height:** 4.2 feet
**Weight:** 30 pounds

3. **Name:** Thunderbolt
**Skill:** Electric-type
**Strength:** 95
**Height:** 4.8 feet
**Weight:** 40 pounds

(...)
ggerganov commented 1 week ago

You can improve your JSON schema to support array and lower the sampling temperature. For example:

-j "{\"type\":\"array\",\"items\":{}}"`
curl \
    --request POST --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Please generate 3 pokemon characters: name, skill, strength, height, weight.\n\n", "temperature": 0.1}' | jq -r .content
[
{"name": "Pikachu", "skill": "Thunderbolt", "strength": 80, "height": 0.4, "weight": 6.0},
{"name": "Charizard", "skill": "Flamethrower", "strength": 130, "height": 1.7, "weight": 90.5},
{"name": "Squirtle", "skill": "Tackle", "strength": 48, "height": 0.5, "weight": 9.0}
]
morgen52 commented 1 week ago

Thank you again! I encountered another issue while following your steps this time. I used the following command to start my server

./llama.cpp-b3938/build_gpu/bin/llama-server -m ../models/Meta-Llama-3-8B-Instruct-Q4_0.gguf -ngl 30 -j "{\"type\":\"array\",\"items\":{}}"

And I sent a request using the command below and got an empty output [].

curl \
    --request POST --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Please generate 3 pokemon characters: name, skill, strength, height, weight.", "temperature": 0.1}' | jq -r .content

The only difference is that I removed the \n\n at the end of the prompt. I'm confused about why these two newlines are so important. What is the reason behind it?

ggerganov commented 1 week ago

To understand better what is going on, you have to think as if you are the LLM. You are asking it to complete the text: "Please generate 3 pokemon characters: name, skill, strength, height, weight.", so it is very likely that the next text that will be generated will start with some whitespace (either a space " ", or a new line \n), because that's normally what text looks like after the end of a sentence. On the other hand, your are asking it to obey a grammar that requires the next character to be opening brace [. So these are 2 conflicting requirements that cannot be satisfied at the same time and obviously the result will not be good.

By adding the new lines in the prompt, you help satisfy the first requirement and now it can continue generating according the JSON schema without conflict. You can alternatively add a space instead of new line:

"Please generate 3 pokemon characters: name, skill, strength, height, weight: "

Notice the space at the end. The logic is the same.

morgen52 commented 1 week ago

Interesting explanation. I understand now. Unfortunately, adding just one space doesn't work for me—I have to add at least two spaces to get a proper response, lol.

But I'm curious, can't this issue be fixed on the software side? From a user's perspective, we don't typically add multiple spaces or line breaks at the end of each prompt. And the fact that not doing so results in ineffective outputs is quite confusing.

ggerganov commented 1 week ago

But I'm curious, can't this issue be fixed on the software side? From a user's perspective, we don't typically add multiple spaces or line breaks at the end of each prompt.

It's not an issue - it works exactly as it is supposed to. The user's perspective should be fixed 😄

morgen52 commented 1 week ago

Haha, thank you for your reply.