Open alchitry opened 1 year ago
While ANTLR is quite accepting of what you type in as a grammar, it doesn't mean that everything it accepts is a 'good' grammar and should just work. In general though, whatever you send in, ANTLR4 will make some sense of.
This isn't a bug - it's just the way it is. If we step back and think "what does EOF mean?", we can see that this grammar actually just doesn't make any sense. EOF literally means "and it all ends here", so placing it in an alt is nonsensical. I realize that the submitted grammar is meant to be a contrived example to show something. But, in fact it is just an example of how not to use EOF. EOF is really a marker to make sure that the parser will keep going until it hits the end of the input stream. It is really only for this:
translationUnit: stuff EOF;
The reason it is not automatically assumed - well, one reason - is that sometimes you just want to parse a statement or an expression etc., and the parser should stop when it completes that, whether there is anything else to look at or not. EOF says "and keep going until end of file".
I sometimes wish ANTLR4 wouldn't just let us write more or less anything we want - though to be fair, that was mostly the point of ANTLR4 and ALL(*). But this flexibility is why half the contributed grammars have just been copied from the normative description in documentation, without any thought to ambiguity and are therefore terrible in performance.
You can't have "some newlines OR end of file". You can only have "some stuff, maybe followed by some trailing newlines, and then the end"
startingRule: 'test' newLine* EOF;
I think one of us needs to write an article on this. And by one of us, I guess I mean me, as I am mentioning it. But basically, the fact that EOF is a 'predefined' lexer rule, should be giving out a clue that it is a state and not an actual lexeme.
@parrt I think we just close this one and I will write something to explain the idea. We have already decided that modifying the tool just to give some warning about this, isn't worth the effort.
@alchitry Thanks for the submission - this comes up fairly regularly and it does need some documentation. I also intend to write a long (hopefully not too boring) article about how to approach writing grammars in general - I will make sure to include this topic.
Grammars really should have an EOF-terminated start rule. Not only does it force the parse to consume all the input, but it makes it easier to detect that the start rule is not used on the right-hand side of another rule as in this grammar:
grammar test;
start: 'foo' EOF;
bar: start 'bad';
What in the world does parser.bar()
recognize?
Start-rule problems have happened a lot in the grammar-v4 repo. Even now, we have a PR that doesn't have an EOF-start rule. It parses only the first few tokens of input fine, but stops on a ':'. (Additionally, the grammar has a lexer catch-all at the end of the grammar, which is, also, probably not a good idea to do: COMMENT : ';' NON_NL* NL -> channel(HIDDEN); ...; NON_NL : ~[\r\n];
. NON_NL
should have been a fragment
.) The Maven test plugin returns "all fine!" but it's anything but fine. The parser gave up at the point where it couldn't continue.
This is why I recommend checking your grammar using automated means. Below is an update to my script to detect bad start symbol grammars.
# Verify that we have an EOF-terminated start rule. E.g.
# foobar : ('foo' | 'bar')* EOF;
#
lines=`trparse $@ | \
trxgrep " //parserRuleSpec[ruleBlock//TOKEN_REF/text()='EOF']/RULE_REF" | \
trtext`
if [ "$lines" == "" ]
then
echo $@ does not have an EOF-start rule.
exit 1
fi
# Verify that we don't have a grammar where the EOF symbol is followed
# by a grammar symbol. E.g.,
# foobar : ('foo' | 'bar')* EOF 'wonderful';
#
lines=`trparse $@ | \
trxgrep ' //parserRuleSpec[.//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element]' | \
trtext`
if [ "$lines" != "" ]
then
echo $lines
echo $@ has an EOF usage followed by another element.
exit 1
fi
# Verify that we don't have a grammar with an EOF in one alt, and not
# in all the other alts. E.g.,
# newLine: '\n'+ | EOF;
#
lines=`trparse $@ | \
trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]/ancestor::parserRuleSpec' | \
trtext`
if [ "$lines" != "" ]
then
echo $lines
echo $@ has an EOF in one alt, but not in another.
exit 1
fi
# Verify that the start symbol is not used on the right-hand side of
# any rule. E.g.,
# startingRule: 'test' newLine EOF;
# unusedBadRule: startingRule '}';
#
lines=`trparse $@ | \
trxgrep 'for $i in (//parserRuleSpec[ruleBlock//TOKEN_REF/text()="EOF"]/RULE_REF/text() ) return //parserRuleSpec[./ruleBlock//RULE_REF = $i]' | \
trtext`
if [ "$lines" != "" ]
then
echo $lines
echo $@ has start symbol that is used on the RHS of a rule.
exit 1
fi
A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just:
ERR: . ;
As the last rule.
On Sat, Apr 22, 2023 at 19:36 Ken Domino @.***> wrote:
Grammars really should have an EOF-terminated start rule. Not only does it force the parse to consume all the input, but it makes it easier to detect that the start rule is not used on the right-hand side of another rule as in this grammar:
grammar test; start: 'foo' EOF; bar: start 'bad';
What in the world does parser.bar() recognize?
Start-rule problems have happened a lot in the grammar-v4 repo. Even now, we have a PR https://github.com/antlr/grammars-v4/pull/2394 that doesn't have an EOF-start rule. It parses only the first few tokens of input fine, but stops on a ':'. (Additionally, the grammar has a lexer catch-all at the end of the grammar, which is, also, probably not a good idea to do: COMMENT : ';' NON_NL* NL -> channel(HIDDEN); ...; NON_NL : ~[\r\n];. NON_NL should have been a fragment.) The Maven test plugin returns "all fine!" but it's anything but fine. The parser gave up at the point where it couldn't continue.
This is why I recommend checking your grammar using automated means. Below is an update to my script to detect bad start symbol grammars.
Verify that we have an EOF-terminated start rule. E.g.
foobar : ('foo' | 'bar')* EOF;
# lines=
trparse $@ | \ trxgrep " //parserRuleSpec[ruleBlock//TOKEN_REF/text()='EOF']/RULE_REF" | \ trtext
if [ "$lines" == "" ] then echo $@ does not have an EOF-start rule. exit 1 fiVerify that we don't have a grammar where the EOF symbol is followed
by a grammar symbol. E.g.,
foobar : ('foo' | 'bar')* EOF 'wonderful';
# lines=
trparse $@ | \ trxgrep ' //parserRuleSpec[.//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element]' | \ trtext
if [ "$lines" != "" ] then echo $lines echo $@ has an EOF usage followed by another element. exit 1 fiVerify that we don't have a grammar with an EOF in one alt, and not
in all the other alts. E.g.,
newLine: '\n'+ | EOF;
# lines=
trparse $@ | \ trxgrep ' //labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]/ancestor::parserRuleSpec' | \ trtext
if [ "$lines" != "" ] then echo $lines echo $@ has an EOF in one alt, but not in another. exit 1 fiVerify that the start symbol is not used on the right-hand side of
any rule. E.g.,
startingRule: 'test' newLine EOF;
unusedBadRule: startingRule '}';
# lines=
trparse $@ | \ trxgrep 'for $i in (//parserRuleSpec[ruleBlock//TOKEN_REF/text()="EOF"]/RULE_REF/text() ) return //parserRuleSpec[./ruleBlock//RULE_REF = $i]' | \ trtext
if [ "$lines" != "" ] then echo $lines echo $@ has start symbol that is used on the RHS of a rule. exit 1 fi— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518616021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDGDIRNM6APBFIC2SDXCO7DPANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>
A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.
It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec, any of the Annex A's of the ISO C++ Spec or working drafts, or even the Python language "spec"). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.
That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';
, input aaabbaaa
, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors);
will output lexer errors, but NumberOfSyntaxErrors
will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.
Normative specs don’t include error specification. But every time I have implemented a language I have done this because you want any error to travel as far up the analysis chain as possible. So the lexer always emits a token and the parser can often give more context than the lexer. My parsers will accept anything that could be valid because a semantic error is better than a syntax error usually. This is the better approach for commercial use. Lexer errors tend not be very useful
Our problem with the contributed grammars is that often, someone has just copied the normative grammar without understanding that it is there to tell you what is valid syntax, not to write a parser from in blind fashion. Hence they are massively ambiguous. I want to spend time improving them.
On Sat, Apr 22, 2023 at 23:04 Ken Domino @.***> wrote:
A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.
It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html, any of the Annex A's of the ISO C++ Spec or working drafts https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf, or even the Python language "spec" https://docs.python.org/3/reference/grammar.html). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.
That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';, input aaabbaaa, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors); will output lexer errors, but NumberOfSyntaxErrors will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.
— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518681375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDL36V32SC65UBV7ILXCPXPZANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>
Rather I should say error handling specification. They will say “all other characters are illegal”
On Sun, Apr 23, 2023 at 10:27 Jim Idle @.***> wrote:
Normative specs don’t include error specification. But every time I have implemented a language I have done this because you want any error to travel as far up the analysis chain as possible. So the lexer always emits a token and the parser can often give more context than the lexer. My parsers will accept anything that could be valid because a semantic error is better than a syntax error usually. This is the better approach for commercial use. Lexer errors tend not be very useful
Our problem with the contributed grammars is that often, someone has just copied the normative grammar without understanding that it is there to tell you what is valid syntax, not to write a parser from in blind fashion. Hence they are massively ambiguous. I want to spend time improving them.
On Sat, Apr 22, 2023 at 23:04 Ken Domino @.***> wrote:
A final lexer catch all rule is a good idea, as you don’t want the lexer throwing errors but it should be just: ERR: . ; As the last rule.
It's a personal dislike: I do not want error rules of any kind in either parser or lexer grammar because they are implementation details. I don't know of any language spec that includes error rules (e.g., the Java Spec https://docs.oracle.com/javase/specs/jls/se20/html/jls-3.html, any of the Annex A's of the ISO C++ Spec or working drafts https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf, or even the Python language "spec" https://docs.python.org/3/reference/grammar.html). Also, analyzing the grammar for missing symbol references becomes harder because it may not be obvious except by an unstandardized naming convention whether an applied occurrence of the symbol is accidentally or purposefully missing from the grammar. I would prefer to change the error recovery code and keep a clear spec/implementation boundary.
That said, Antlr lexer errors aren't percolated to the parser. For example, for grammar test; start: A* EOF; A: 'a';, input aaabbaaa, parser.start(); System.Console.WriteLine(parser.NumberOfSyntaxErrors); will output lexer errors, but NumberOfSyntaxErrors will be 0. Either a lexer catch-all must be added to generate a token that the parse can choke on, or one needs to add code to keep track of lexer error count. Since I can't assume such rules in any of the grammars in grammars-v4, I have to do the later.
— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4242#issuecomment-1518681375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMDL36V32SC65UBV7ILXCPXPZANCNFSM6AAAAAAXFXBEDI . You are receiving this because you commented.Message ID: @.***>
This all makes sense and as I said before I don't think it is a big deal. It doesn't make a lot of sense for EOF to be somewhere not at the end of the grammar.
I ran into this while writing tests for my code that were parsing smaller sections of a grammar and not starting at the main starting rule. I was trying to write the grammar so semicolons would be optional (line break would count as one).
I added the EOF rule as a valid "semicolon" replacement so when I ran the tests on one line it would parse it correctly. In the normal use case of a full parse this would never happen.
In my tests I just terminate the single lines with a semicolon and removed the EOF from the rule. Now the only EOF is the main start rule that terminates with it.
I ran into a weird bug where EOF is failing to match with the error
no viable alternative at input '<EOF>'
Here is a minimal grammar to reproduce it.
If you feed in simply "test" and start at
startingRule
it'll fail with the EOF error.If you remove
unusedBadRule
or removestartingRule
from it, it works as expected. I'm not sure why this is as this rule isn't even being used.If you replace
newLine
with theEOF
token directly (or remove the\n
part from the rule) it works. Also removing the'}'
from the end ofunusedBadRule
also makes it work.I tested this with 4.12.0 targeting Java and using the Intellij plugin.
I'd say this is a fairly low priority bug as I don't think it happens if you start at the highest point in the grammar. I ran into it while writing tests for my code.