arithy / packcc

A parser generator for C
Other
347 stars 28 forks source link

Lookahead woes and more #72

Closed joagre closed 7 months ago

joagre commented 11 months ago

I have been trying to write a small PEG to parse the following in a file named simple.sa:

42,
a = 42,
a = b + c * d,
a.b,
a.b.c,
a.b(1, 2),
42.b,
4711.b(),
a[666],
a[777].foo

It works except for the last a[777].foo.

At times I think there must be a bug in packcc's lookahead pattern support and then I realize it must be me misunderstanding something central. I wish there was more documentation on how to successfully use positive and negative lookahead. I'm stumped. Can anyone have mercy on me and point out what I need to do to the PEG below to work as expected?

Cheers /J

# File: simple.peg
# Test: packcc simple.peg && gcc -o simple simple.c && ./simple < simple.sa

%prefix "satie"

%earlysource {
    static const char *dbg_str[] = { "Evaluating rule", "Matched rule", "Abandoning rule" };
    #define PCC_DEBUG(auxil, event, rule, level, pos, buffer, length) \
        fprintf(stderr, "%*s%s %s @%zu [%.*s]\n", (int)((level) * 2), "", dbg_str[event], rule, pos, (int)(length), buffer)
}

Program            <- _ TopLevelExpr (_ "," _ TopLevelExpr)* EOF
TopLevelExpr       <- Binding / Expr

Binding            <- MatchPattern _ "=" _ Expr
MatchPattern       <- Literal / FieldAccess / Symbol

Expr               <- Add
Add                <- Multiplicate (_ "+" _  Multiplicate)*
Multiplicate       <- Indexing (_ "*" _  Indexing)*
Indexing           <- Symbol _ "[" _ Expr _ "]"  / FunctionCall
FunctionCall       <- Symbol _ "(" _ ExprSequence? _ ")" / FieldAccess
FieldAccess        <- HasField (_ "." _ HasField)* / Primary
HasField           <- Symbol !(_ ("(" / "[")) / Literal / Indexing / FunctionCall
Primary            <- Literal / Symbol
ExprSequence       <- Expr (_ "," _ Expr)*
Literal            <- NumberLiteral
NumberLiteral      <- [0-9]+
Symbol             <- [a-zA-Z_][a-zA-Z_0-9_]*

_                  <- WS*
WS                 <- [ \t\r\n]
EOF                <- _ !.

%%
int main() {
    satie_context_t *context = satie_create(NULL);
    satie_parse(context, NULL);
    satie_destroy(context);
    return 0;
}
dolik-rce commented 11 months ago

I don't think it is a problem related to lookahead...

The a[777].foo correctly matches FunctionCall rule, but it doesn't match Indexing. That is partially because Indexing tries to first match only index access, and it never gets to FunctionCall. Switching the order will help: Indexing <- FunctionCall / Symbol _ "[" _ Expr _ "]". But this is not the only problem, which makes it much harder to debug, because even if you fix one problem, it still doesn't work :slightly_smiling_face:

I'd suggest to rewrite the grammar slightly to separate the index access from field access. Currently they're kind of intertwined (HasField contains Indexing). I can't tell you how exactly, I'd have to give it much more time than I have available, sorry.

joagre commented 11 months ago

Thanks for the input. I can see now that my grammar snippet is ambiguous/faulty and need to be rewritten. From a completelly naively point of view I would like to have an even more ambiguous grammar though:

Program            <- _ TopLevelExpr (_ "," _ TopLevelExpr)* EOF
TopLevelExpr       <- Expr / Binding

# Precedence ordered
Expr               <- Add
Add                <- Multiplicate (_ "+" _  Multiplicate)*
Multiplicate       <- Indexing (_ "*" _  Indexing)*
Indexing           <- Expr _ "[" _ Expr _ "]"  / FunctionCall
FunctionCall       <- FunctionName _ "(" _ ExprSequence? _ ")" / FieldAccess
FieldAccess        <- Expr (_ "." _ Expr)* / Primary

FunctionName       <- Symbol
Symbol             <- [a-zA-Z_][a-zA-Z_0-9_]*
ExprSequence       <- Expr (_ "," _ Expr)*

Primary            <- Literal / Symbol
Literal            <- NumberLiteral
NumberLiteral      <- [0-9]+

Binding            <- MatchPattern _ "=" _ Expr
MatchPattern       <- Literal / FieldAccess / Symbol

_                  <- WS*
WS                 <- [ \t\r\n]
EOF                <- _ !.

:-)

It would require a parser with indefinite backtracking. Maybe not even that would suffice. Life is hard.

Thanks