antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
16.99k stars 3.26k forks source link

Antlr g4 files and Unicode, problems. #4013

Open Korporal opened 1 year ago

Korporal commented 1 year ago

Discussed in https://github.com/antlr/antlr4/discussions/4012

Originally posted by **Korporal** December 14, 2022 I'm reading through Terence Parr's book and experimenting as I go, I'm a Windows user. The tools seem to work, everything it setup and I'm generating Java for now (will do C# later when I'm more accustomed to all this). Anyway, I'm stumped, a tiny grammar (literally a tweak from what's in the book) causes me issues, look, here's the grammar and a sample input file: ![image](https://user-images.githubusercontent.com/12262952/207656200-f860d6d2-f8f1-46b7-b34b-1707e04b1e30.png) This is the output from the `grun` command: ![image](https://user-images.githubusercontent.com/12262952/207656480-725fc1f4-1257-46a5-a761-7738afddaf31.png) I cannot understand why the simplest input `a=1` causes a problem! Surely a single digit `1` meets the definition of `INT` in that grammar? Here is the file itself as a raw data dump: ![image](https://user-images.githubusercontent.com/12262952/207657338-51aea50f-108f-4bbf-8c91-8ead7aceeb53.png) If I change the definition of `INT` to be simply `INT: '1' ;` then it parses that file fine. ## Resolved Ok its Unicode again, here is what's actually in the `g4` file: ![image](https://user-images.githubusercontent.com/12262952/207666015-579103f5-0670-4b63-b28d-459ed42bc2c9.png) Note, that's actually a 'EN DASH' character with code `E28093` - see [here](https://www.fileformat.info/info/unicode/char/2013/index.htm).

This strikes me as a weakness in Antlr, especially for newbies copying and pasting examples and samples from the web. This has literally costs me four hours!

I want to suggest that the system be updated to detect the presence of possibly misleading Unicode characters in grammar files. Visually [0-9] looks the same whether the hyphen be a simple ASCII hyphen or one of the several similar looking Unicode chars.

The antlr tool did not object to the Unicode EN-DASH inside a regular expression, yet at runtime it completely misbehaves (or seems to). So perhaps reporting invalid characters used in regular expressions will help...

kaby76 commented 1 year ago

This is caused by a copy/paste, or manually typing in text from paper, of a grammar. We should be scraping grammars using automatic means. A tool can enforce consistency and repeatability of the scrape.

That said, I don't think Antlr will be changed to detect patterns in LEXER_CHAR_SET that contain character ranges, which should use a Minus Sign, but instead, use En Dash or Em Dash or Small Em Dash. I have seen this exact problem with the ISO C++ language specs, because the specs are in Latex. Antlr does not warn against something like [a-] either. Is that supposed to recognize 'a' or '-'? Or, was it meant to be a range like hex digits? Even worse is something like [ab-.#,]. Is this supposed to be a range (default) or a set containing '-'? I have seen this several times in some of the grammars in grammars-v4. This sounds like a job for a "linter".

KvanTTT commented 1 year ago

Antlr does not warn against something like [a-] either. Is that supposed to recognize 'a' or '-'?

ANTLR treats it as only two chars a and -.

Even worse is something like [ab-.#,]. Is this supposed to be a range (default) or a set containing '-'?

ANTLR treats it as range from b to .. ANTLR create a range if only hypren is surrounded by chars from both sides. It's not clear but we can't change such behaivor because it breaks back compatibility.

BTW it's described in documentation, see Lexer Rule Elements