Feature Request: Parser syntactical candy to override tokenization

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

BSD 3-Clause "New" or "Revised" License

17.15k stars 3.28k forks source link

It been years since I've used Antlr2 for a project. At that time is was too costly to engage.

I'm revisiting antlr4 for a sql parser using the tsql grammar.

One issue I've been running into is how keywords are employed within SQL grammars. When dumping database schemas, the lightest weight syntax is usually employed. In some cases, keywords may be found in "user defined" data, table and column names. Whether the character sequence is considered a token is dependent on grammatical position (positional contexts).

A similar concept was discussed in closed issue #483.

I have attempted a proof of concept using predicate rewrite rules. For my Antlr skill level, I found them to be error prone and extremely difficult to debug. IMO, using rewrite predicates makes the capability unengagable for most users writing grammars. Plus, I couldn't figure out how to disable the token using rewrite rules. I'm, also assuming greed parsing would need to be disabled until the override is complete.

Currently Antlr4 supports a simple lexical context mechanism via the lexer mode capability providing a "broad brush" (domain context) control over tokenization.

Rewrite rules already provide much of the code infrastructure to provide the override capability. The parser syntax can be extended to incorporate a "simple", intuative, enable and disable token(s). A more advanced mechanism would define named "token sets", similar to lexical modes.

Yeah, you bumped into the same issue. Sql are grammars with multiple meanings overlaid on a character set.

Your comments decribe an undeniable truth in Token value comparison operator, there are far too many identifiers/keywords in sql to create grammatical rules for each one. Additionally, dialects are inconsistent making all those rules unmaintainable.

Maybe some small steps integrating existing capabilities would start the process moving forward for the main code base. Possibly a quick context switch in parser rules as a staging step prior to greedy engagement would be a place to start.. Along with names sets and supporting lexer and parser syntax supporting the capabilities.

An awful alternative, until context awareness is implemented, is to just add platform code implementing context managements. Since Antlr support a bunch of target languages, platform code maintenance will be a laborious task.

antlr / antlr4

Feature Request: Parser syntactical candy to override tokenization #2268