Closed darxkies closed 11 months ago
I'm also experiencing this issue on the latest commit (fe680e3d1080a765e5d3150ffd7bab189742898d). Grammar sampling was working fine in the Oct 28 commit (ff3bad83e29e3009010cbc923bebd769055eaa7f) but after I switched to the latest commit, the generated outputs become gibberish. My grammar also contains whitespace, which I suspect might be the issue.
I tracked the source of the issue down to the commit 5f6e0c0dff1e7a89331e6b25eca9a9fd71324069. The issue originates from an optimization which replaced llama_token_to_piece
with model.vocab.id_to_token[id].text
. This causes problem with models, like CodeLlama, that use the SentencePiece tokenizer because they require llama_unescape_whitespace
to unescape whitespace.
Reverting the changes in llama_sample_grammar
and llama_grammar_accept_token
fixed the issue for me:
// const std::string & piece = ctx->model.vocab.id_to_token[id].text;
const std::string piece = llama_token_to_piece(ctx, id);
@darxkies Can you check if reverting back to llama_token_to_piece
fix the issue you experienced?
@AlienKevin, I can confirm that this fixes the issue. I experienced same problem with Emerhyst 20B and Llama 2 Chat 13B.
@@ -7554,11 +7546,13 @@ void llama_sample_grammar(struct llama_context * ctx, llama_token_data_array * c
const llama_token eos = llama_token_eos(&ctx->model);
std::vector<std::pair<std::vector<uint32_t>, llama_partial_utf8>> candidates_decoded;
+ candidates_decoded.reserve(candidates->size);
std::vector<llama_grammar_candidate> candidates_grammar;
+ candidates_grammar.reserve(candidates->size);
for (size_t i = 0; i < candidates->size; ++i) {
const llama_token id = candidates->data[i].id;
- const std::string piece = llama_token_to_piece(ctx, id);
+ const std::string & piece = ctx->model.vocab.id_to_token[id].text;
if (id == eos) {
if (!allow_eos) {
candidates->data[i].logit = -INFINITY;
@@ -7770,7 +7764,7 @@ void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar
GGML_ASSERT(false);
}
- const std::string piece = llama_token_to_piece(ctx, token);
+ const std::string & piece = ctx->model.vocab.id_to_token[token].text;
// Note terminating 0 in decoded string
const auto decoded = decode_utf8(piece, grammar->partial_utf8);
@AlienKevin By restoring the two last changes from the patch above that were introduced in b1614, it works again.
Is anyone interested in a pull request to revert the two lines?
I think besides the PR, we might benefit from several regression tests for grammar sampling. The current test for grammar sampling seems to be manually written and quite hard to read. Maybe we can add some simpler tests based on larger grammars that just ensures that the sampling process never fails? I'll be happy to contribute my grammar as a test case.
Submitted a PR with the two line fix: https://github.com/ggerganov/llama.cpp/pull/4396
It would suffice if it was tested against the included grammar files I think. That is what I used to reproduce the bug.
There is another bug I found recently that I have to submit and causes the OS to kill it.
With the latest commit it works again as intended.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Use grammar files containing white spaces.
Current Behavior
When using a grammar file and the rules contain ' ' or \n then main and server crash with the following error:
GGML_ASSERT: llama.cpp:7776: !grammar->stacks.empty() ptrace: Operation not permitted. No stack. The program is not being run.
Environment and Context
CPU: i9-12900H OS: Arch
Steps to Reproduce
b1613 was the latest tag that used to work properly. Afterwards it keeps crashing.
It is not related to the model. It failed with several 7b/13b models.