EOF fails to match when starting rule used elsewhere.

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.12k stars 3.28k forks source link

EOF fails to match when starting rule used elsewhere. #4242

Open alchitry opened 1 year ago

alchitry commented 1 year ago

I ran into a weird bug where EOF is failing to match with the error no viable alternative at input '<EOF>'

Here is a minimal grammar to reproduce it.

grammar Test;

startingRule: 'test' newLine;
unusedBadRule: startingRule '}';
newLine: '\n'+ | EOF;

If you feed in simply "test" and start at startingRule it'll fail with the EOF error.

If you remove unusedBadRule or remove startingRule from it, it works as expected. I'm not sure why this is as this rule isn't even being used.

If you replace newLine with the EOF token directly (or remove the \n part from the rule) it works. Also removing the '}' from the end of unusedBadRule also makes it work.

I tested this with 4.12.0 targeting Java and using the Intellij plugin.

I'd say this is a fairly low priority bug as I don't think it happens if you start at the highest point in the grammar. I ran into it while writing tests for my code.

kaby76 commented 1 year ago

See https://github.com/antlr/antlr4/issues/3851.

jimidle commented 1 year ago

While ANTLR is quite accepting of what you type in as a grammar, it doesn't mean that everything it accepts is a 'good' grammar and should just work. In general though, whatever you send in, ANTLR4 will make some sense of.

This isn't a bug - it's just the way it is. If we step back and think "what does EOF mean?", we can see that this grammar actually just doesn't make any sense. EOF literally means "and it all ends here", so placing it in an alt is nonsensical. I realize that the submitted grammar is meant to be a contrived example to show something. But, in fact it is just an example of how not to use EOF. EOF is really a marker to make sure that the parser will keep going until it hits the end of the input stream. It is really only for this:

translationUnit: stuff EOF;

The reason it is not automatically assumed - well, one reason - is that sometimes you just want to parse a statement or an expression etc., and the parser should stop when it completes that, whether there is anything else to look at or not. EOF says "and keep going until end of file".

I sometimes wish ANTLR4 wouldn't just let us write more or less anything we want - though to be fair, that was mostly the point of ANTLR4 and ALL(*). But this flexibility is why half the contributed grammars have just been copied from the normative description in documentation, without any thought to ambiguity and are therefore terrible in performance.

You can't have "some newlines OR end of file". You can only have "some stuff, maybe followed by some trailing newlines, and then the end"

startingRule: 'test' newLine* EOF;

I think one of us needs to write an article on this. And by one of us, I guess I mean me, as I am mentioning it. But basically, the fact that EOF is a 'predefined' lexer rule, should be giving out a clue that it is a state and not an actual lexeme.

@parrt I think we just close this one and I will write something to explain the idea. We have already decided that modifying the tool just to give some warning about this, isn't worth the effort.

@alchitry Thanks for the submission - this comes up fairly regularly and it does need some documentation. I also intend to write a long (hopefully not too boring) article about how to approach writing grammars in general - I will make sure to include this topic.

kaby76 commented 1 year ago

Grammars really should have an EOF-terminated start rule. Not only does it force the parse to consume all the input, but it makes it easier to detect that the start rule is not used on the right-hand side of another rule as in this grammar:

grammar test;
start: 'foo' EOF;
bar: start 'bad';

What in the world does parser.bar() recognize?

Start-rule problems have happened a lot in the grammar-v4 repo. Even now, we have a PR that doesn't have an EOF-start rule. It parses only the first few tokens of input fine, but stops on a ':'. (Additionally, the grammar has a lexer catch-all at the end of the grammar, which is, also, probably not a good idea to do: COMMENT : ';' NON_NL* NL -> channel(HIDDEN); ...; NON_NL : ~[\r\n];. NON_NL should have been a fragment.) The Maven test plugin returns "all fine!" but it's anything but fine. The parser gave up at the point where it couldn't continue.

This is why I recommend checking your grammar using automated means. Below is an update to my script to detect bad start symbol grammars.

# Verify that we have an EOF-terminated start rule. E.g.
# foobar : ('foo' | 'bar')* EOF;
#
lines=`trparse $@ | \
    trxgrep " //parserRuleSpec[ruleBlock//TOKEN_REF/text()='EOF']/RULE_REF" | \
    trtext`
if [ "$lines" == "" ]
then
  echo $@ does not have an EOF-start rule.
  exit 1
fi

# Verify that we don't have a grammar where the EOF symbol is followed
# by a grammar symbol. E.g.,
# foobar : ('foo' | 'bar')* EOF 'wonderful';
#
lines=`trparse $@ | \
    trxgrep ' //parserRuleSpec[.//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element]' | \
    trtext`
if [ "$lines" != "" ]
then
  echo $lines
  echo $@ has an EOF usage followed by another element.
  exit 1
fi

# Verify that we don't have a grammar with an EOF in one alt, and not
# in all the other alts. E.g.,
# newLine: '\n'+ | EOF;
#
lines=`trparse $@ | \
    trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]/ancestor::parserRuleSpec' | \
    trtext`
if [ "$lines" != "" ]
then
  echo $lines
  echo $@ has an EOF in one alt, but not in another.
  exit 1
fi

# Verify that the start symbol is not used on the right-hand side of
# any rule. E.g.,
# startingRule: 'test' newLine EOF;
# unusedBadRule: startingRule '}';
#
lines=`trparse $@ | \
    trxgrep 'for $i in (//parserRuleSpec[ruleBlock//TOKEN_REF/text()="EOF"]/RULE_REF/text() ) return //parserRuleSpec[./ruleBlock//RULE_REF = $i]' | \
    trtext`
if [ "$lines" != "" ]
then
  echo $lines
  echo $@ has start symbol that is used on the RHS of a rule.
  exit 1
fi

jimidle commented 1 year ago

A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just:

ERR: . ;

As the last rule.

On Sat, Apr 22, 2023 at 19:36 Ken Domino @.***> wrote:

Grammars really should have an EOF-terminated start rule. Not only does it force the parse to consume all the input, but it makes it easier to detect that the start rule is not used on the right-hand side of another rule as in this grammar:

grammar test; start: 'foo' EOF; bar: start 'bad';

What in the world does parser.bar() recognize?

Start-rule problems have happened a lot in the grammar-v4 repo. Even now, we have a PR https://github.com/antlr/grammars-v4/pull/2394 that doesn't have an EOF-start rule. It parses only the first few tokens of input fine, but stops on a ':'. (Additionally, the grammar has a lexer catch-all at the end of the grammar, which is, also, probably not a good idea to do: COMMENT : ';' NON_NL* NL -> channel(HIDDEN); ...; NON_NL : ~[\r\n];. NON_NL should have been a fragment.) The Maven test plugin returns "all fine!" but it's anything but fine. The parser gave up at the point where it couldn't continue.

This is why I recommend checking your grammar using automated means. Below is an update to my script to detect bad start symbol grammars.

Verify that we have an EOF-terminated start rule. E.g.

foobar : ('foo' | 'bar')* EOF;

# lines=trparse $@ | \ trxgrep " //parserRuleSpec[ruleBlock//TOKEN_REF/text()='EOF']/RULE_REF" | \ trtext if [ "$lines" == "" ] then echo $@ does not have an EOF-start rule. exit 1 fi

Verify that we don't have a grammar where the EOF symbol is followed

by a grammar symbol. E.g.,

foobar : ('foo' | 'bar')* EOF 'wonderful';

# lines=trparse $@ | \ trxgrep ' //parserRuleSpec[.//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element]' | \ trtext if [ "$lines" != "" ] then echo $lines echo $@ has an EOF usage followed by another element. exit 1 fi

Verify that we don't have a grammar with an EOF in one alt, and not

in all the other alts. E.g.,

newLine: '\n'+ | EOF;

# lines=trparse $@ | \ trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]/ancestor::parserRuleSpec' | \ trtext if [ "$lines" != "" ] then echo $lines echo $@ has an EOF in one alt, but not in another. exit 1 fi

Verify that the start symbol is not used on the right-hand side of

any rule. E.g.,

startingRule: 'test' newLine EOF;

unusedBadRule: startingRule '}';

# lines=trparse $@ | \ trxgrep 'for $i in (//parserRuleSpec[ruleBlock//TOKEN_REF/text()="EOF"]/RULE_REF/text() ) return //parserRuleSpec[./ruleBlock//RULE_REF = $i]' | \ trtext if [ "$lines" != "" ] then echo $lines echo $@ has start symbol that is used on the RHS of a rule. exit 1 fi

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518616021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDGDIRNM6APBFIC2SDXCO7DPANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>

kaby76 commented 1 year ago

A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.

It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec, any of the Annex A's of the ISO C++ Spec or working drafts, or even the Python language "spec"). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.

That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';, input aaabbaaa, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors); will output lexer errors, but NumberOfSyntaxErrors will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.

jimidle commented 1 year ago

Normative specs don’t include error specification. But every time I have implemented a language I have done this because you want any error to travel as far up the analysis chain as possible. So the lexer always emits a token and the parser can often give more context than the lexer. My parsers will accept anything that could be valid because a semantic error is better than a syntax error usually. This is the better approach for commercial use. Lexer errors tend not be very useful

Our problem with the contributed grammars is that often, someone has just copied the normative grammar without understanding that it is there to tell you what is valid syntax, not to write a parser from in blind fashion. Hence they are massively ambiguous. I want to spend time improving them.

On Sat, Apr 22, 2023 at 23:04 Ken Domino @.***> wrote:

A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.

It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html, any of the Annex A's of the ISO C++ Spec or working drafts https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf, or even the Python language "spec" https://docs.python.org/3/reference/grammar.html). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.

That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';, input aaabbaaa, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors); will output lexer errors, but NumberOfSyntaxErrors will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518681375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDL36V32SC65UBV7ILXCPXPZANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>

jimidle commented 1 year ago

Rather I should say error handling specification. They will say “all other characters are illegal”

On Sun, Apr 23, 2023 at 10:27 Jim Idle @.***> wrote:

Normative specs don’t include error specification. But every time I have implemented a language I have done this because you want any error to travel as far up the analysis chain as possible. So the lexer always emits a token and the parser can often give more context than the lexer. My parsers will accept anything that could be valid because a semantic error is better than a syntax error usually. This is the better approach for commercial use. Lexer errors tend not be very useful

Our problem with the contributed grammars is that often, someone has just copied the normative grammar without understanding that it is there to tell you what is valid syntax, not to write a parser from in blind fashion. Hence they are massively ambiguous. I want to spend time improving them.

On Sat, Apr 22, 2023 at 23:04 Ken Domino @.***> wrote:

A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.

It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html, any of the Annex A's of the ISO C++ Spec or working drafts https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf, or even the Python language "spec" https://docs.python.org/3/reference/grammar.html). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.

That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';, input aaabbaaa, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors); will output lexer errors, but NumberOfSyntaxErrors will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518681375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDL36V32SC65UBV7ILXCPXPZANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>

alchitry commented 1 year ago

This all makes sense and as I said before I don't think it is a big deal. It doesn't make a lot of sense for EOF to be somewhere not at the end of the grammar.

I ran into this while writing tests for my code that were parsing smaller sections of a grammar and not starting at the main starting rule. I was trying to write the grammar so semicolons would be optional (line break would count as one).

I added the EOF rule as a valid "semicolon" replacement so when I ran the tests on one line it would parse it correctly. In the normal use case of a full parse this would never happen.

In my tests I just terminate the single lines with a semicolon and removed the EOF from the rule. Now the only EOF is the main start rule that terminates with it.