Failing to parse adjacent terminals because of built-in Ident token

jmitchell commented 8 years ago

Using the OCaml and Haskell backends, this grammar fails to parse the string aa:

entrypoints MyString ;

TwoChars. MyString ::= "a" "a" ;

$ bnfc -m --ocaml string.cf && make
$ echo "aa" | ./Teststring 
Parse error at 1.0-1.2
$ bnfc --version
2.8.1

Changing the rule to TwoChars. MyString ::= "aa" ; makes aa parse as expected.

I discovered this while trying to implement my own non-terminal for strings without relying on regular expressions or the built-in definitions.

jmitchell commented 8 years ago

The C backend parses aa as expected using the original grammar. Unlike the Haskell and OCaml backends, it also handles this slightly more abstract version:

TwoChars. MyString ::= MyChar MyChar ;
Char_a. MyChar ::= "a" ;

gdetrez commented 8 years ago

Hi @jmitchell. The inconsistency between the two backend is definitively a problem that we need to look at.

However, I my interpretation of the LBNF reference, I would say that in this case the bug is in the C backend. It shouldn't accept aa for "a" "a" as the lexer is supposed to follow the spacing convention of C-like languages.

I discovered this while trying to implement my own non-terminal for strings without relying on regular expressions or the built-in definitions.

I'm curious about why you don't want to use regular expressions? Do you want to implement a lexer that isn't a regular language?

jmitchell commented 8 years ago

I'm pleased to hear BNFC aims for backend consistency.

What I found in the LBNF reference regarding lexer spacing conventions is somewhat vague:

The lexer follows rules familiar from languages like Haskell, C, and Java, including longest match and spacing conventions.

It's not clear that by design, for some backend lexers, whitespace is implicitly interspersed between a production's terms. I expected the grammar of the backend language wouldn't leak into the grammar specified in the .cf file. For those authoring a BNF grammar, avoiding grammatical leaks seems important. Is there a way to disable implicit reinterpretations of the hand-authored BNFC grammars?

I'm curious about why you don't want to use regular expressions? Do you want to implement a lexer that isn't a regular language?

Since regular languages are subsumed by context-free grammars I wanted to see how far I could go using only BNF. In principle I shouldn't need to rely on regular expressions when specifying a context-free grammar nor should I need to think about lexers. I understand, however, that for historical reasons it's common to rely on regular expressions and don't fault BNFC if regular expressions are necessary to define certain grammars.

andreasabel commented 6 years ago

There is two things to do:

[ ] Fix the C backend
[ ] Allow overwrite and removal of built-in token types #197

andreasabel commented 4 years ago

Lexing of keywords is at the moment implemented very differently in the functional and non-functional backends:

Haskell and Ocaml just lex Idents and then decide whether one of the keywords was lexed as an Ident and possibly convert it. Thus, always whole words are lexed, and "aa" is lexed as one word irrespective of the grammar.
C/C++/Java put the keywords directly into the lexer definition and thus can lex "aa" as keyword "a" followed by "a". This seems to violate the intuition behind BNFC.

To bring the backends into sync, we could either do something special in the Haskell and Ocaml backends if no identifier-like token type is part of the grammar. However, since BNFC is a tool for prototyping programming languages, this is hardly worth the effort, as all PLs have identifiers.

Or we fix the C-family backends to arbitrarily include Ident in the lexer definition even if it is not used in the grammar. Not sure who is helped by this either.

Concludingly, I find this issue out of scope for the intended applications of BNFC. We can leave the behavior of BNFC undefined in such corner cases.

I'll close this as wont-fix.

BNFC / bnfc

Failing to parse adjacent terminals because of built-in Ident token #179