BNFC / bnfc

BNF Converter
http://bnfc.digitalgrammars.com/
587 stars 165 forks source link

Antlr backend - quotation marks in bracket expressions are escaped when they shouldn't #319

Closed fonfalleh closed 4 years ago

fonfalleh commented 4 years ago

It seems the only characters that should be escaped in bracket expressions in regexes are ], \, and -. I'm not sure if this means that there needs to be different escaping in different contexts. https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexer-rule-elements

Example token rule that generates broken code (not by any means good or correct, I just noticed that the resulting lexer file doesn't work) : token NoteToken ["abcdefgr"]({"es"} | {"is"})*["\',"]*(digit)*["."]* ; results in the following line in the Lexer.g4 file NoteToken : [abcdefgr]('e''s'|'i''s')*[\',]*DIGIT*'.'*; which generates the following when building: warning(156): lily/lilyLexer.g4:83:38: invalid escape sequence \'

The build also complains about the following line: STRINGTEXT : ~[\"\\] -> more;

https://github.com/BNFC/bnfc/blob/3ca72116c4a8f541dff2778450efabd72219cf8e/source/src/BNFC/Backend/Java/CFtoAntlr4Lexer.hs#L157

The build works as expected when removing the extra backslashes as follows: NoteToken : [abcdefgr]('e''s'|'i''s')*[',]*DIGIT*'.'*; ... STRINGTEXT : ~["\\] -> more;


Sidenote: I first thought this could be related to this line, referencing RegToJLex.hs instead of RegToAntlrLexer.hs, but it seems the reference is correct, even if it's confusing naming. https://github.com/BNFC/bnfc/blob/3ca72116c4a8f541dff2778450efabd72219cf8e/source/src/BNFC/Backend/Java/CFtoAntlr4Lexer.hs#L150

Export from RegToAntlrLexer: https://github.com/BNFC/bnfc/blob/3ca72116c4a8f541dff2778450efabd72219cf8e/source/src/BNFC/Backend/Java/RegToAntlrLexer.hs#L1

andreasabel commented 4 years ago

It seems that the regular expression printer does not apply special printing rules when printing content in bracketed expressions, but maybe it should, according to the rules you quoted above:

The following escaped characters are interpreted as single special characters: \n, \r, \b, \t, \f, \uXXXX, and \u{XXXXXX}. To get ], \, or - you must escape them with \. (From https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md#lexer-rule-elements)

The problematic lines in BNFC are thus: https://github.com/BNFC/bnfc/blob/3ca72116c4a8f541dff2778450efabd72219cf8e/source/src/BNFC/Backend/Java/RegToAntlrLexer.hs#L79 https://github.com/BNFC/bnfc/blob/3ca72116c4a8f541dff2778450efabd72219cf8e/source/src/BNFC/Backend/Java/RegToAntlrLexer.hs#L69-L72 There, instead of calling the prt function recursively, a special print function for content inside brackets should be called.

andreasabel commented 4 years ago

@fonfalleh : Can you test if PR #321 works for you?

fonfalleh commented 4 years ago

@fonfalleh : Can you test if PR #321 works for you?

Seems to work, thanks! :+1:

andreasabel commented 4 years ago

Great!

andreasabel commented 3 years ago

My fix wasn't complete, see #329.