Open Korporal opened 1 year ago
This is caused by a copy/paste, or manually typing in text from paper, of a grammar. We should be scraping grammars using automatic means. A tool can enforce consistency and repeatability of the scrape.
That said, I don't think Antlr will be changed to detect patterns in LEXER_CHAR_SET that contain character ranges, which should use a Minus Sign, but instead, use En Dash or Em Dash or Small Em Dash. I have seen this exact problem with the ISO C++ language specs, because the specs are in Latex. Antlr does not warn against something like [a-]
either. Is that supposed to recognize 'a' or '-'? Or, was it meant to be a range like hex digits? Even worse is something like [ab-.#,]
. Is this supposed to be a range (default) or a set containing '-'? I have seen this several times in some of the grammars in grammars-v4. This sounds like a job for a "linter".
Antlr does not warn against something like [a-] either. Is that supposed to recognize 'a' or '-'?
ANTLR treats it as only two chars a
and -
.
Even worse is something like [ab-.#,]. Is this supposed to be a range (default) or a set containing '-'?
ANTLR treats it as range from b
to .
. ANTLR create a range if only hypren is surrounded by chars from both sides. It's not clear but we can't change such behaivor because it breaks back compatibility.
BTW it's described in documentation, see Lexer Rule Elements
Discussed in https://github.com/antlr/antlr4/discussions/4012
This strikes me as a weakness in Antlr, especially for newbies copying and pasting examples and samples from the web. This has literally costs me four hours!
I want to suggest that the system be updated to detect the presence of possibly misleading Unicode characters in grammar files. Visually
[0-9]
looks the same whether the hyphen be a simple ASCII hyphen or one of the several similar looking Unicode chars.The
antlr
tool did not object to the UnicodeEN-DASH
inside a regular expression, yet at runtime it completely misbehaves (or seems to). So perhaps reporting invalid characters used in regular expressions will help...