antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
16.95k stars 3.26k forks source link

New lexer context option proposal: emitOnMatch=true ? #491

Open Anthony-Breneliere opened 10 years ago

Anthony-Breneliere commented 10 years ago

As I understand, the lexer choose the first lexer rule that perfectly matches the longest string it can do with incoming characters.

1st case: The lexer reads HELLO: Rule 'HELLO' -> 5 characters matched Rule [A-Z]+ -> 5 characters matched Rule 1 WINS, because it is the first rule in the lexer

2nd case: The lexer reads HELLOS: Rule 'HELLO' -> 5 characters matched Rule [A-Z]+ -> 6 characters matched Rule 2 WINS, because it consumes more characters

I would be useful to have a context option (like greedy) that tell the lexer to stop trying to find a longer match. In the second case that could make the rule 1 win.

In my example I would like the lexer to identify special strings that contains files, and a default string if no special string identified .

"C:\share\archive\toto.txt"  => QUOTE VOLUME DIRFILE DIRFILE DIRFILE QUOTE
"\\server\archive\toto.txt"  => QUOTE SERVER DIRFILE QUOTE
"Any text with characters like : or \"  => QUOTE STRING_VALUE QUOTE

The rules would be:

QUOTE : '"';
VOLUME : [a-zA-Z] ':'
DIRFILE: '\' FILECHARS;
SERVER : '\\' FILECHARS;
STRING_VALUE : ~["]+;

The problem is VOLUME 'C:' and SERVER '\server' cannot be matched because of the default STRING_VALUE, that matches a longer string ( C:\toto.txt ) than VOLUME (C:). So I will allways have the tokens QUOTE STRING_VALUE QUOTE transmitted to the parser. If I add the option on the two rules that are located BEFORE the STRING_VALUE rule:

VOLUME : ( options emitOnMatch =true : [a-zA-Z] ':' );
DIRFILE: ( options emitOnMatch =true : '\' FILECHARS);
SERVER : ( options emitOnMatch =true : '\\' FILECHARS);

It's an idea for a solution to parse the case above. It would also allow the lexer to process some bytecode languages where is no separators between keywords, where keywords lengths vary, and where there are data values to distinguish from keywords.

sharwell commented 10 years ago

What happens if you have the following input?

"Any text with characters like: or \"

Would the e: within that string match VOLUME or still be part of STRING_VALUE?

Anthony-Breneliere commented 10 years ago

On that case the behavior does not change: As the lexer has already read Any text with characters lik that is part of STRING_VALUE, the rule VOLUME has been discarded when the lexer has encountered the 'n' of Any that is different of ':' required by VOLUME. In the same way, the input of a SERVER (like \\server) contains the characters of a directory DIRFILE , but when the lexer consumes the second \, it discards the rule DIRFILE.

Anthony-Breneliere commented 10 years ago

The case where e: that matches volume could make another option => #492

whitten commented 10 years ago

I'm struggling to figue out how you can get a short match given that you are creating a DFA as you parse. Are you supposed to mark states that could be stop states as final?

On Thu, Mar 13, 2014 at 8:37 AM, Lavazza notifications@github.com wrote:

The case where e: that matches volume could make another option => #492https://github.com/antlr/antlr4/issues/492

Reply to this email directly or view it on GitHubhttps://github.com/antlr/antlr4/issues/491#issuecomment-37527659 .

Anthony-Breneliere commented 10 years ago

I do not know what is a stop state. But if I take this example:

 The lexer reads HELLOS:
 Rule 'HELLO' -> 5 characters matched
 Rule [A-Z]+ -> 6 characters matched

When the lexer reaches the O of rule HELLO (that is a rule marked 'emitOnMatch'), then it will not read the S for the rule [A-Z]+. Do you mean adding a stop on the state 'O' from 'HELLO' ?