antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.03k stars 3.27k forks source link

[Antlr4.7] token recognition error at #1992

Open gaulouis opened 7 years ago

gaulouis commented 7 years ago

Hello,

I'm trying antlr 4.7 (Java runtime) with a test grammar in this repository

$ echo "<?php echo Hell ?>" | java org.antlr.v4.gui.TestRig Php block -tree
line 1:11 token recognition error at: 'H'
line 1:12 token recognition error at: 'el'
line 1:14 token recognition error at: 'l'
(block (prolog <?php) (statement (function_echo echo)) (epilog ?>))

"Hell" is is incorrect. So, it's normal that I get token recognition error at: 'el' with two caracter ?

gaulouis commented 7 years ago

I have try the same thing with Antlr 4.6

gaulouis@gaulouis-desktop:~/local/src/tmp/test/php_antlr$ java org.antlr.v4.gui.TestRig Test block -tree
<?php echo Hell ?>
line 1:11 token recognition error at: 'H'
line 1:12 token recognition error at: 'el'
line 1:14 token recognition error at: 'l'
(block (prolog <?php) (statement (function_echo echo)) (epilog ?>))
gaulouis@gaulouis-desktop:~/local/src/tmp/test/php_antlr$ java org.antlr.v4.gui.TestRig Test block -tree
<?php echo HEll ?>
line 1:11 token recognition error at: 'H'
line 1:12 token recognition error at: 'E'
line 1:13 token recognition error at: 'l'
line 1:14 token recognition error at: 'l'
(block (prolog <?php) (statement (function_echo echo)) (epilog ?>))

I am surprised to get two different errors cause of the case line 1:12 token recognition error at: 'el' line 1:12 token recognition error at: 'E'

And same behaviour with Antlr 4.5.3/4.5.4

To get antlr 4.6 i do

$git clone https://github.com/antlr/antlr4 antlr4.6
$cd  antlr4.6
$git checkout -b antlr_4-6 4.6
$mvn -DskipTests install
$export CLASSPATH="`pwd`/tool/target/antlr4-4.6-complete.jar"
$java org.antlr.v4.Tool
ANTLR Parser Generator  Version 4.6

Maybe I made a mistake somewhere? Can you help me ?

siliconvoodoo commented 4 years ago

it's probably because 'e' was matching the beginning of 'echo'. I wouldn't bother too much, if it swallows valid code, and refuses invalid code that's the mandate.

timmc commented 2 years ago

The problem is that your lexer doesn't cover all valid character sequences. There's no token that corresponds to "H", "He", "Hel", etc. So, what you can do is add an ANY : . ; rule at the very end that matches all the stuff that nothing else matched. Then have ANY in your parser rules wherever you want to match arbitrary text.