antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.23k stars 3.29k forks source link

Symbolic token names #238

Closed rensink closed 10 years ago

rensink commented 11 years ago

May I suggest to make the "symbolic token name" available to the programmer? Currently, defining

STRUCT: 'struct';
LBRACE: '{';

produces tokens named 'struct' and '{' where I would prefer to have STRUCT and LBRACE. There seems to be no way to access these user-defined names except by reflection on the parser class to retrieve the static field names.

Thanks, Arend

arifogel commented 10 years ago

I have this exact issue right now. I'm trying to pretty-print parse trees, and only the tokens with complicated rules have their symbolic names in the ParserGrammar.getTokenNames() list.

arifogel commented 10 years ago

I should point out that there is a second non-ideal way to access these names: via the generated .tokens file.

parrt commented 10 years ago

Hhmm.... well, we set it up so that it would give the "display" name in tokenNames, with the idea that it was likely more useful to the end-user then things like LBRACE.

arifogel commented 10 years ago

But the text can be extracted from the token anyway. Here's an example of what my current pretty-printer outputs:

Parsing: "/home/arifogel/git/batfish/test_rigs/unit-tests/configs/underscore_variable"...OK, PRINTING PARSE TREE:


(cisco_configuration (stanza (null_stanza (closing_comment COMMENT_CLOSING_LINE:'!\n'))) (stanza (hostname_stanza 'hostname':'hostname' VARIABLE:'underscore_variable' NEWLINE:'\n')) (stanza (null_stanza (closing_comment COMMENT_CLOSING_LINE:'!\n'))) (stanza (route_map_stanza (route_map_named_stanza 'route-map':'route-map' VARIABLE:'JKL_MNO_PQR' (route_map_tail (access_list_action 'permit':'permit') DEC:'100' NEWLINE:'\n' (route_map_tail_tail (rm_stanza (match_rm_stanza (match_ip_prefix_list_rm_stanza 'match':'match' 'ip':'ip' 'address':'address' 'prefix-list':'prefix-list' VARIABLE:'ABC_DEF' VARIABLE:'_GHI' NEWLINE:'\n')))) (closing_comment COMMENT_CLOSING_LINE:'!\n'))))) 'end':'end' NEWLINE:'\n' EOF:)


I would much prefer to output e.g. "MATCH:'match'" instead of "'match':'match'", especially in the cases where the token name does not quite correspond to the literal text. In that vein we have a token IP_ADDRESS_LITERAL which matches 'ip-address'. But we also have a token IP_ADDRESS which matches actual ip addresses. If a user sees 'ip-address':'ip-address', they might think that somehow came out of the IP_ADDRESS rule, which is not the case.

As an aside, what happens when two lexer rules in different modes match the same text, e.g. 'text'? Will both of the token's tokenNames entries be 'text'? Or will one be 'text' and the other use the symbolic name?

For Reference: Repository: gihub.com/arifogel/batfish Commit: c7ba5184766c038bdd5750fdc38053eec4f2b87c File: projects/batfish/src/batfish/grammar/ParseTreePrettyPrinter.java

On 08/08/2014 05:56 PM, Terence Parr wrote:

Hhmm.... well, we set it up so that it would give the "display" name in tokenNames, with the idea that it was likely more useful to the end-user then things like LBRACE.

— Reply to this email directly or view it on GitHub https://github.com/antlr/antlr4/issues/238#issuecomment-51672360.