Closed jmitchell closed 4 years ago
The C backend parses aa
as expected using the original grammar. Unlike the Haskell and OCaml backends, it also handles this slightly more abstract version:
TwoChars. MyString ::= MyChar MyChar ;
Char_a. MyChar ::= "a" ;
Hi @jmitchell. The inconsistency between the two backend is definitively a problem that we need to look at.
However, I my interpretation of the LBNF reference, I would say that in this case the bug is in the C backend. It shouldn't accept aa
for "a" "a"
as the lexer is supposed to follow the spacing convention of C-like languages.
I discovered this while trying to implement my own non-terminal for strings without relying on regular expressions or the built-in definitions.
I'm curious about why you don't want to use regular expressions? Do you want to implement a lexer that isn't a regular language?
I'm pleased to hear BNFC aims for backend consistency.
What I found in the LBNF reference regarding lexer spacing conventions is somewhat vague:
The lexer follows rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions.
It's not clear that by design, for some backend lexers, whitespace is implicitly interspersed between a production's terms. I expected the grammar of the backend language wouldn't leak into the grammar specified in the .cf
file. For those authoring a BNF grammar, avoiding grammatical leaks seems important. Is there a way to disable implicit reinterpretations of the hand-authored BNFC grammars?
I'm curious about why you don't want to use regular expressions? Do you want to implement a lexer that isn't a regular language?
Since regular languages are subsumed by context-free grammars I wanted to see how far I could go using only BNF. In principle I shouldn't need to rely on regular expressions when specifying a context-free grammar nor should I need to think about lexers. I understand, however, that for historical reasons it's common to rely on regular expressions and don't fault BNFC if regular expressions are necessary to define certain grammars.
There is two things to do:
Lexing of keywords is at the moment implemented very differently in the functional and non-functional backends:
Haskell and Ocaml just lex Ident
s and then decide whether one of the keywords was lexed as an Ident
and possibly convert it. Thus, always whole words are lexed, and "aa" is lexed as one word irrespective of the grammar.
C/C++/Java put the keywords directly into the lexer definition and thus can lex "aa" as keyword "a" followed by "a". This seems to violate the intuition behind BNFC.
To bring the backends into sync, we could either do something special in the Haskell and Ocaml backends if no identifier-like token type is part of the grammar. However, since BNFC is a tool for prototyping programming languages, this is hardly worth the effort, as all PLs have identifiers.
Or we fix the C-family backends to arbitrarily include Ident
in the lexer definition even if it is not used in the grammar. Not sure who is helped by this either.
Concludingly, I find this issue out of scope for the intended applications of BNFC. We can leave the behavior of BNFC undefined in such corner cases.
I'll close this as wont-fix.
Using the OCaml and Haskell backends, this grammar fails to parse the string
aa
:Changing the rule to
TwoChars. MyString ::= "aa" ;
makesaa
parse as expected.I discovered this while trying to implement my own non-terminal for strings without relying on regular expressions or the built-in definitions.