abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.11k stars 844 forks source link

Grammars bracket repetition symbol not working #1547

Open Viagounet opened 1 week ago

Viagounet commented 1 week ago

Hello, I tried checking for similar issues about this problem but couldn't find one. I've had an issue with not being able to use the repetition brackets symbol when working with grammars.

I'm using Ubuntu 20.04, Python 3.12 and llama_cpp_python==0.2.79.

The following works fine:

from llama_cpp import LlamaGrammar

grammar_string = r"""root ::= "repeating" [a-z]+"""
my_grammar = LlamaGrammar.from_string(grammar_string, verbose=True)

But this doesn't:

from llama_cpp import LlamaGrammar

grammar_string = r"""root ::= "repeating" [a-z]{1,}"""
my_grammar = LlamaGrammar.from_string(grammar_string, verbose=True)

It returns this error:

parse: error parsing grammar: expecting newline or end at {1,}
Traceback (most recent call last):
    my_grammar = LlamaGrammar.from_string(grammar_string, verbose=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/envs/DocLLM/lib/python3.12/site-packages/llama_cpp/llama_grammar.py", line 71, in from_string
    raise ValueError(
ValueError: from_string: error parsing grammar file: parsed_grammar.rules is empty

The llama.cpp GBNF Guide seems to say it should be possible to use this pattern?

Repetition and Optional Symbols

* after a symbol or sequence means that it can be repeated zero or more times (equivalent to {0,}).
+ denotes that the symbol or sequence should appear one or more times (equivalent to {1,}).
? makes the preceding symbol or sequence optional (equivalent to {0,1}).
{m} repeats the precedent symbol or sequence exactly m times
{m,} repeats the precedent symbol or sequence at least m times
{m,n} repeats the precedent symbol or sequence at between m and n times (included)
{0,n} repeats the precedent symbol or sequence at most n times (included)

Not sure if my understanding of GBNF is lacking or if it's a real bug. Thank you!

yamikumo-DSD commented 6 days ago

Same here. Even an example that llama.cpp gives doesn't work with it, so I don't think its your fault. llama.cpp's json GBNF example

C0deMunk33 commented 5 days ago

running into this as well on a slightly more complicated gbnf on a line value ::= object_type_1 | object_type_2

"parse: error parsing grammar: expecting newline or end at object_type_1 | object_type_2"

yamikumo-DSD commented 5 days ago

This is a snippet of test GBNF that llama_cpp is offering in llama_grammar.py.

string ::=
  "\"" (
    [^"\\\x7F\x00-\x1F] |
    "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
  )* "\"" ws

In this code, repetition is done by actually repeating (writing down) token as many as we want, which differs from what original JSON GBNF sample does. So, my current understanding is that disabled {m} is intended feature and not a bug of llama-cpp-python's GBNF unlike the original GBNF. I'm still not sure the reason why llama-cpp-python choose this behavior tho.

Viagounet commented 4 days ago

RIght, thanks for your answers. I ended up writing a function to automatically convert the bracket syntax into a set of repeating tokens. Not very elegant, but works well enough.

HanClinto commented 2 days ago

Bracket support for grammars was added about 3 weeks ago in https://github.com/ggerganov/llama.cpp/pull/6640 -- is this Python library referencing a version that includes this newest change?