darxkies commented 11 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Use grammar files containing white spaces.

Current Behavior

When using a grammar file and the rules contain ' ' or \n then main and server crash with the following error:

GGML_ASSERT: llama.cpp:7776: !grammar->stacks.empty() ptrace: Operation not permitted. No stack. The program is not being run.

Environment and Context

CPU: i9-12900H OS: Arch

Steps to Reproduce

make clean; make LLAMA_CUBLAS=1 LLAMA_SERVER_VERBOSE=1 -j 20 main; ./main -b 4096 -c 4096 -m models/dolphin-2.2.1-mistral-7b.Q8_0.gguf -f prompts/react.txt --grammar-file grammars/json_arr.gbnf

b1613 was the latest tag that used to work properly. Afterwards it keeps crashing.

It is not related to the model. It failed with several 7b/13b models.

AlienKevin commented 11 months ago

I'm also experiencing this issue on the latest commit (fe680e3d1080a765e5d3150ffd7bab189742898d). Grammar sampling was working fine in the Oct 28 commit (ff3bad83e29e3009010cbc923bebd769055eaa7f) but after I switched to the latest commit, the generated outputs become gibberish. My grammar also contains whitespace, which I suspect might be the issue.

My Grammar

AlienKevin commented 11 months ago

I tracked the source of the issue down to the commit 5f6e0c0dff1e7a89331e6b25eca9a9fd71324069. The issue originates from an optimization which replaced llama_token_to_piece with model.vocab.id_to_token[id].text. This causes problem with models, like CodeLlama, that use the SentencePiece tokenizer because they require llama_unescape_whitespace to unescape whitespace.

https://github.com/ggerganov/llama.cpp/blob/5f6e0c0dff1e7a89331e6b25eca9a9fd71324069/llama.cpp#L9997-L10008

Reverting the changes in llama_sample_grammar and llama_grammar_accept_token fixed the issue for me:

// const std::string & piece = ctx->model.vocab.id_to_token[id].text;
const std::string piece = llama_token_to_piece(ctx, id);

@darxkies Can you check if reverting back to llama_token_to_piece fix the issue you experienced?

mastermindbaguvix commented 11 months ago

@AlienKevin, I can confirm that this fixes the issue. I experienced same problem with Emerhyst 20B and Llama 2 Chat 13B.

darxkies commented 11 months ago

@@ -7554,11 +7546,13 @@ void llama_sample_grammar(struct llama_context * ctx, llama_token_data_array * c
     const llama_token eos = llama_token_eos(&ctx->model);

     std::vector<std::pair<std::vector<uint32_t>, llama_partial_utf8>> candidates_decoded;
+    candidates_decoded.reserve(candidates->size);
     std::vector<llama_grammar_candidate>                              candidates_grammar;
+    candidates_grammar.reserve(candidates->size);

     for (size_t i = 0; i < candidates->size; ++i) {
         const llama_token id    = candidates->data[i].id;
-        const std::string piece = llama_token_to_piece(ctx, id);
+        const std::string & piece = ctx->model.vocab.id_to_token[id].text;
         if (id == eos) {
             if (!allow_eos) {
                 candidates->data[i].logit = -INFINITY;
@@ -7770,7 +7764,7 @@ void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar
         GGML_ASSERT(false);
     }

-    const std::string piece = llama_token_to_piece(ctx, token);
+    const std::string & piece = ctx->model.vocab.id_to_token[token].text;

     // Note terminating 0 in decoded string
     const auto   decoded     = decode_utf8(piece, grammar->partial_utf8);

@AlienKevin By restoring the two last changes from the patch above that were introduced in b1614, it works again.

darxkies commented 11 months ago

Is anyone interested in a pull request to revert the two lines?

AlienKevin commented 11 months ago

I think besides the PR, we might benefit from several regression tests for grammar sampling. The current test for grammar sampling seems to be manually written and quite hard to read. Maybe we can add some simpler tests based on larger grammars that just ensures that the sampling process never fails? I'll be happy to contribute my grammar as a test case.

AlienKevin commented 11 months ago

Submitted a PR with the two line fix: https://github.com/ggerganov/llama.cpp/pull/4396

darxkies commented 11 months ago

It would suffice if it was tested against the included grammar files I think. That is what I used to reproduce the bug.

There is another bug I found recently that I have to submit and causes the OS to kill it.

darxkies commented 11 months ago

With the latest commit it works again as intended.

ggerganov / llama.cpp

Process crash due to white spaces in grammar file. #4376

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce