antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.26k stars 3.72k forks source link

Design issue with C grammar #3554

Open nickion opened 1 year ago

nickion commented 1 year ago

I've come across a design issue with the C grammar in rules such as:

| '(' argumentExpressionList? ')'

This is within postfixExpression.

A C function call will always have a number of arguments, with that number possibly being zero, whereas the grammar describes that there may be a a set of one or more arguments or no set at all. It's a subtle by crucial distinction, and a consequence is that in a scenario such as:

foo()("bar")

within a visitor for postfix expressions there will be only one argumentExpressionList available, which would be assumed to be applied to the primary expression Identifier foo, rather than to the result of calling foo with no arguments.

This can be resolved by changing the grammar to:

| '(' argumentExpressionList ')'

which expresses that there is always an argument list, and to change the argumentExpressionList to

argumentExpressionList
    : /* empty */
    | assignmentExpression (',' assignmentExpression)*
    ;

which describes that an argument expression list may be nothing or one or more expressions. With this revision a postfix expression visitor can correctly determine the number of function calls in a chain from the length of an argumentExpressionList() result, and the number of arguments in each function call is the length of the result of calling assignmentExpressionList(). I did try labelling and parenthesising the arglist, e.g. '(' arglist+=(argumentExpressionList?) ')' thinking that Antlr4 might then always generate a value even though it may be empty, but this did not work. I've only been using Antlr for a couple of days and there may be an approach for resolving that I've not discovered yet and without requiring a grammar revision, but the technique described above is one I've always used when designing languages and writing Yacc/Bison parsers for them, and so far appears also to work fine for Antlr.

kaby76 commented 1 year ago

The postfixExpression rule comes almost verbatim from the C Language Specification. I think the Spec committee favors the EBNF argumentExpressionList? over allowing argumentExpressionList to derive the empty string in order to avoid issues like kleene operators on the empty string and recursion. I don't have an issue of refactoring rules to derive empty, but it is another step to record when I automate the process scraping this grammar when a new version of the spec comes out. The bigger issue is that the expression rules are not in optimized Antlr syntax. The grammar is slow because of the chained-rule implementation for operator precedence.