amazon-science / incremental-parsing

Incremental Python parser for constrained generation of code by LLMs.
Apache License 2.0
12 stars 3 forks source link

[Question] IncrementalLexer but with whitespaces and return in line between the words of my grammar #3

Closed AzizCode92 closed 4 months ago

AzizCode92 commented 4 months ago

Hi, First of all thank you very much for this great work. I have defined a very simple ebnf grammar and adapt the lark_to_context to it. Same way it is done for the calculator. Now when playing around with the interactive recognition script, I realised that if I put a space between the words of my grammar, the state become invalid. The state is Completable only if I put input text in a continuous form. My grammar is very simple and looks like this:

start: class_def 
class_def: "CLASS" "MyClass" "DEFINITION"

CLASSMyClassDEF -> Completable CLASS MyClass DEF -> Invalid

any help on how to do that whithout explicitly add WS between the words of my grammar? thank you

Edit: I have found this in the lark documentation and adapt the grammar defined here but it failed same way as the above example. https://lark-parser.readthedocs.io/en/latest/examples/indented_tree.html

dmelcer9 commented 1 month ago

Sorry I didn't see this- Currently, whitespace and comments are handled in python_lex_wrapper.py; in particular calc_modified_hint (https://github.com/amazon-science/incremental-parsing/blob/e2b9eabfe4274b916e6cf2cf5081b76370d20a61/incremental_parsing/lex_earley/python_lex_wrapper.py#L148); this is because the exact rules of when various whitespace tokens are allowed end up being surprisingly language-dependent.

If you want to add whitespace to a custom language, you'd probably want to create a new implementation of AbstractLexer that wraps an IncrementalLexer in a similar manner to PythonLexWrapper. lexer_hint and initialize would need to add the appropriate whitespace tokens to the list of allowed tokens, and the other methods would need to turn the LexResultSuccess into LexResultPartial when a whitespace token is actually matched.

(For reference, the parser calls lexer_hint with the set of scannable terminals given by the grammar; the lexer will then fail-fast if the in-progress symbol is not in that set. Because whitespace/comments are never part of the grammar itself, the wrapper needs to add these lexemes as allowed, and then handle the case when the lexer reads them)