antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.04k stars 3.27k forks source link

A new lexer command: Less #212

Open massaad opened 11 years ago

massaad commented 11 years ago

It will be helpful if we can do some look-ahead while in the lexer without tying the grammar to a particular target language.

So, I am proposing adding a new lexer command that is almost the opposite of more. Using an example derived from the example in the book (in section 12.4):

SPECIAL_OPEN : '<?' Name -> less, pushMode(PROC_INSTR);

when matched, will switch to PROC_INSTR mode, but then will start matching again starting from <?.

sharwell commented 11 years ago

This is a pretty interesting idea. I was concerned about the semantics when I saw the title, but your explanation is quite clear and it might not be too difficult to implement.

parrt commented 11 years ago

Hi. so do you propose that with the "less", it would merge lexer rules XMLDeclOpen and SPECIAL_OPEN, making it less complex? Can you think of another use case this? I thought at first you are going to suggest something like &foo which succeeds if foo is found next on the input stream but without consuming it; kind of a lookahead operator.

massaad commented 11 years ago

The use case I have is similar to the example in the book, but it is not XML, and there are open tokens but not close tokens. So, I want to lookahead to see the next open token to switch modes.

As for a lookahead operator instead of a lexer command, it does not really matter to me, they will both work equally well. I just proposed a lexer command to follow the style of ANTLR, since it already had lexer commands.

Of course, I know that this can be very easily done with lexer predicates, but I would like to not tie my grammar to a target language.

grosenberg commented 10 years ago

This issue is related to one I raised some time ago (cannot find the issue #, but see https://github.com/antlr/antlr4/blob/master/tool/test/org/antlr/v4/test/PositionAdjustingLexer.g4 and http://antlr.1301665.n2.nabble.com/Lookahead-predicates-in-the-Lexer-td7579180.html).

The issue is simply to have some non-native (or better said, a native-Antlr) way to disambiguate alternatives in the lexer.

mode funny_sequence_with_odd_exit_pattern ;
A : X Y Z { ~Comma }? -> popMode ; // the Comma token is not consumed 
B : .  ;

Not really specific to lexer modes, though it would make defining when to enter/exit much easier. In v3, it was done using a predicate. Being able to write

A : X Y Z ~&Comma ; // consuming only X Y and Z if the next element is not a comma

would work as well.

gagern commented 8 years ago

In the question Syntactic predicates in ANTLR lexer rules on Stack Overflow I've looked at the INT '..' vs. REAL: INT '.' ambiguity from the Antlr v2 manual and the Antlr 3 faq, looking for ways to address this in Antlr 4. The way I see it, the methods suggested here, both less and ~&'.', could be used to address that, too. It would be good to have some way of looking ahead and then consuming only part of the matched text.

Groostav commented 8 years ago

I'm writing a yaml parser, and to determine if the root is a horizontal or vertical style yaml document, I'd like the DEFAULT_MODE to do a quick switch for me:

two equivalent yaml documents:

#horizontal
{ x: { y : z} }
#vertical
x:
  y: z

as you might imagine, this maps to two lexer modes, vertical and horizontal. So what Id like the DEAFAULT_MODE to contain is simply

INDENT_H_INIT : '{' -> type(INDENT_H), mode(HORIZONTAL) ;
INDENT_V_INIT : . -> less, type(INDENT_V_0), mode(VERTICAL);

without that less rule, I'm actually having a really hard time doing this.

so I'm :+1: here.

Groostav commented 8 years ago

huh, and github's own syntax highlighter doesn't like my inline yaml document, looks like they might need to use less also :p

parrt commented 8 years ago

why don't modes alone solve this? can't see it right away.

Groostav commented 8 years ago

my problem is picking the initial mode, which I don't want to do in java, I want it in my lexer. Keeping in mind that in my document I want the x to match a subsequent KEY lexer rule, I need to match the first character as either either empty string or '{' to get myself into the correct initial mode. So far the rule that I'm trying right now is

INITIAL_INDENT_V : {} -> skip, mode(value), pushMode(vertical);

Which appears to do the job, though it of course it causes ANTLR to complain about a rule that can match the empty string.

And I believe both me and @gagern's thinking is that there's probably a way to do it with SDT's that mess around with _input, and, in my case, are empty-string matchers, but that isn't idiomatic to anybody reading ANTLR. I'd infinitely prefer for my rule to be INDENT : . -> less than INDENT : {} or INDENT : . {_input.foolAround()}

parrt commented 8 years ago

Why doesn't the first char, '{' or not, tell you what to do? First mode is default (horiz) but you flip to vert if you see '{'.

Groostav commented 8 years ago

Its a little bit more complicated because value is a valid yaml document, which should be interpreted as a value instead of a key (I guess?), which requires a third entry in this bootstrapping table, but I must confess I've been thinking of horizontal as my default mode, and if I simply assume vertical is my default mode I can probably avoid this problem. I've got a little too attached the idea of an initial 'bootstrapping' default mode, with named modes for everything else.

Still, would be nice if there was a formalism for rules that consume nothing.

frogmonster commented 8 years ago

ANTLR is one of my favorite pieces of software, and it's quite honestly very hard to find anything lacking. So, I feel bad for even posting this :-). But, this is one feature that I would be particularly fond of.

There are times when one would like to match a given character sequence but without consuming the token -- in order to pass the entire sequence to another mode and guide the parser by feeding it a token of a particular type.

As a work around, I've used a bit of a hack, where I override the NextToken() method and rollback the stream when a particular token is found. It works, but it's just not as readable as a "less" command.

It's not that a lexer / parser can't be written without the less command (or without overriding NextToken()); it's just that one can greatly simplify the grammar with the ability to match without consuming in the lexer. I think the "less" command would really round out the feature set of the mode based lexer.

parrt commented 8 years ago

These syntactic predicates are occasionally useful, yep!

KvanTTT commented 8 years ago

It is very natural to have a less command in ANTLR.

Consider the following non-greedy rule for multiline comments:

MultilineComment:    '/*' .? '*/' -> channel(HIDDEN);

At present time modes with more command allows us to implement such comments without non-greedy syntax:

MultilineCommentStart:        '/*'  -> more, mode(COMMENTS);

mode COMMENTS;

MultilineComment:             '*/'  -> mode(DEFAULT_MODE), channel(HIDDEN);
MultilineCommentNotAsterisk: ~'*'+  -> more;
MultilineCommentAsterisk:     '*'   -> more;

With less command it will look like this:

MultilineComment:             '/*' ->  mode(COMMENTS), channel(HIDDEN);

mode COMMENTS;

MultilineCommentEnd:          '*/'  -> less, mode(DEFAULT_MODE);
MultilineCommentNotAsterisk: ~'*'+  -> less;
MultilineCommentAsterisk:     '*'   -> less;

The second version is more intuitive on my opinion because of MultilineComment token starts from the default mode and other tokens appended to him. In the first case MultilineComment token ends the sequence of tokens, that is not very clear.

Also such comments can be implemented without more or less commands. In this case parts of token can be handled in parser. It can be used for JavaDoc parsing for example.

multilineComment
    : MultilineCommentStart (MultilineCommentNotAsterisk | MultilineCommentAsterisk)* MultilineCommentEnd
    ;

MultilineCommentStart:        '/*' -> mode(COMMENTS), channel(HIDDEN);

mode COMMENTS;

MultilineCommentEnd:          '*/'  -> mode(DEFAULT_MODE), channel(HIDDEN);
MultilineCommentNotAsterisk: ~'*'+  -> channel(HIDDEN);
MultilineCommentAsterisk:     '*'   -> channel(HIDDEN);
sharwell commented 7 years ago

@KvanTTT wrote:

The second version is more intuitive on my opinion because of MultilineComment token starts from the default mode and other tokens appended to him.

This is not how the less command would work. After a token is emitted (in your case MultilineComment emits a token), that token cannot be modified. ANTLR lexers are pull-based lexers that match and emit exactly one token on demand, at which point recognition stops until the next token is matched.

If implemented, the less command would work exactly like more, except that it would also reset the input stream position back to the beginning of the token (as though no input were matched so far for the current token).

:bulb: You could implement a less command of the form you described by declaring a LESS token type in your grammar, using -> type(LESS), and creating a custom TokenStream implementation (potentially extending CommonTokenStream) which uses lookahead to conditionally merge tokens at that point. But this would be a stream feature, not a lexer feature.

sharwell commented 7 years ago

💭 Here's an idea for a use for this proposed command. This sample assumes JavaDoc comments are parsed as normal comments initially. Then the complete text of just a single JavaDoc comment is passed to this lexer to break it into tokens. The "decorations" (the /** and */ that delimit the comment, plus leading * on each line) are placed on their own channel, and the rest of the input appears as tokens on the default channel.

JavaDocStart : '/**' -> channel(DECORATOR), pushMode(JavaDoc);

mode JavaDoc;

  JavaDocNewLine : '\r'? '\n' -> type(NEWLINE), pushMode(JavaDocLineStart);
  JavaDocText : ~[*\r\n]+;
  JavaDocAsterisk : '*' -> type(JavaDocText);
  END_COMMENT : '*/' -> channel(DECORATOR), popMode;

mode JavaDocLineStart;

  JavaDocLineEnd : [ \t]* '*/' -> less, popMode;
  JavaDocLineDecorator : [ \t]* '*' -> channel(DECORATOR), popMode;
  JavaDocLineStart_Other : -> less, popMode;

This all assumes less has the following very simple implementation in Lexer:

public void less() {
  _input.seek(this._tokenStartCharIndex);
  _type = MORE;
}
sharwell commented 7 years ago

You could also use it to get rid of some uses of PositionAdjustingLexer. In the following example, the identifier channels is treated as a keyword only when it's followed by a {.

PredicatedChannels
    :   'channels' [ \t\r\n]* '{' -> less, pushMode(ChannelsCommand)
    ;

mode ChannelsCommand;

    CHANNELS : 'channels' -> popMode;
parrt commented 7 years ago

Instead of less command, I think I'd prefer a syntactic predicate, in particular, a "not predicate" that says "match this token but only if the left or right context is not x"

tvkit commented 7 years ago

@parrt why don't modes alone solve this?

I have a child mode that should pop when the parent mode terminator is encountered, but without consuming the terminating token; otherwise, the parent won't pop. And it's not enough to write a lexer rule that consume all but the parent's terminator; i.e., the parser needs to see the individual tokens. As we'd all prefer, the child-mode understands the requirements of the parent-mode.

Consider the following example with an ant-like variable subject to bash-like var manipulation:

"some text with a variable ${myvar::10} substring manipulation" "some text with a variable ${myvar::10:5} substring manipulation" "some text with a variable ${myvar:/this/that} replace manipulation"

The lexer's inital non-default parent-mode is triggered by the '${'. The subsequent child-mode is triggered by the optional colon. The parser needs to see the individual ops and operands for the bash-like manipulation deliniated by the child-mode. Lexer rules that consume all but the parent-mode's terminator do not provide the parser adequate details about the individual manipulator tokens. The manipulator's mode (in my example) should allow for any combination of characters excluding the non-escaped parent-mode terminator, '}'.

Perhaps it's not necessary to mode the manipulator, but it seems like a reasonable capability.

DylanLukes commented 6 years ago

@tvkit I have the same problem you have in implementing a lexer grammar which includes white space delimited (unquoted) strings, which can potentially conflict with keywords.

Here is an example, using the Swift runtime, of an adapted form of PositionAdjustingLexer. Rather than overriding emit and defining handlers (and thus incurring overhead in every emit call), I instead opted to define an action which allows adjusting the accept position. I also take advantage of my mode-pushing rule only matching one character, which means that keyword rules take precedence.

https://gist.github.com/DylanLukes/aa5790554c3b4b453b8741d009838c85

Key excerpt:

open class MyBaseLexer : PositionAdjustingLexer {
    /**
     * Adjusts the accept position, accepts negative numbers.
     */
    func adjustAcceptPosition(by: Int) -> Bool {
        let interp = (getInterpreter() as! PositionAdjustingLexerATNSimulator)

        let (start, stop) = (_tokenStartCharIndex, getCharIndex())
        let textLength = stop - start

        let lastMatchLength = getText().utf8.count

        if (getInputStream()!.index() > start + textLength - lastMatchLength) {
            let offset = textLength + by - 1

            interp.resetAcceptPosition(
                getInputStream()!,
                _tokenStartCharIndex + offset,
                _tokenStartLine,
                _tokenStartCharPositionInLine + offset)
            return true
        }

        return false
    }

    /**
     * Shortcut for ease of writing in a lexer.
     */
    func adjust(by offset: Int) -> Void {
        let _ = adjustAcceptPosition(by: offset)
    }
}
// ...
KEYWORD : 'foo' | 'bar' | 'baz' ;
WS : [ \t\r\n] ;

// KEYWORD will always take precedence over this rule, as it only matches 1 character.
UnquotedString_Start: UNQUOTED_STRING_LEAD_CHAR -> more, pushMode(UnquotedString);

fragment UNQUOTED_STRING_LEAD_CHAR : [a-zA-Z] ;
fragment UNQUOTED_STRING_CHAR : UNQUOTED_STRING_LEAD_CHAR | [0-9] ;

mode UnquotedString;
// Match terminating whitespace, but don't consume it.
UnquotedString_End  : (WS | EOF) { adjust(by: -1) } -> type(UNQUOTED_STRING), popMode ;
UnquotedString_Char : UNQUOTED_STRING_CHAR  -> more ;

This should be pretty easily adaptable to any other host language. Note that you can just define methods directly using @members without the protocol/inheritance hackery I'm using here to define mine outside of the grammar. See here for a canonical example from this repository.


While I am 👍 for a less command, using the above example as illustration: I don't think it's actually enough.

less would be best complemented by another command enough, which is it's opposite: it ends MORE mode, rather than beginning it, but like less, does not consume any input. This could be implemented by marking a position at the beginning of the lexer rule decorated with enough, and rewinding the input stream to that marked position if the rule matches.

Both less and enough fill different niches. They can't stand in for one-another. enough cannot be implemented in terms of less, because less would initiate MORE, but enough would terminate it.

As an example, unquoted strings as described above could be much more simply defined using enough:

UnquotedString_Start: UNQUOTED_STRING_LEAD_CHAR → more, pushMode(UnquotedString); 

mode UnquotedString;

UnquotedString_End  : WS -> enough, type(UNQUOTED_STRING), popMode ;
UnquotedString_Char : UNQUOTED_STRING_CHAR  -> more ;

You could use less as well here if you wanted to transition into distinct modes for the first character and subsequent characters:

UnquotedString_Start: UNQUOTED_STRING_LEAD_CHAR → less, pushMode(UnquotedString_FirstChar); 

mode UnquotedString_FirstChar;
// Emit a token that can be used to provide better error messages.
BAD_PREFIX : ('foo' | 'bar' | 'baz') → popMode ;
UnquotedString_First : UNQUOTED_STRING_LEAD_CHAR  → more, mode(UnquotedString);

mode UnquotedString;
UnquotedString_End  : WS -> enough, type(UNQUOTED_STRING), popMode ;
UnquotedString_Char : UNQUOTED_STRING_CHAR  -> more ;

@parrt I hope this addresses why modes are currently not powerful enough. It's not currently possible to transition in and out of modes without consuming any characters (without custom actions, i.e. in a language-agnostic manner), which limits their utility. You can technically use an empty lexer rule to exit a mode, but it will emit a warning and seems like undefined behavior.

@gagern I don't think that a syntactic predicate like ~&'...' alone would be sufficient either. This doesn't solve the problem of exiting a mode (and terminating a MORE collection) without consuming input. It couldn't be used to write an unquoted string grammar as shown above, for example.

sinaa commented 5 years ago

To revive this discussion, we've faced a similar need to both the OP and @DylanLukes .

Having both less and enough would be tremendously helpful.

tobega commented 5 years ago

I have a situation with dereferencing values that currently involves a fair amount of duplication in several lexer modes.

Consider the case of string interpolation, 'Hello $name;, you look good today!' where the $name; gets replaced by the dereferenced value. Now obviously this is a case where being in a string pretty much requires being in a separate lexer mode, so I need to switch modes at '$' to do dereferencing and then switch back at ';'. So far so good.

Dereferencing can of course be more involved with fields '$person.name;' or arrays '$names(1)' or any chain of such.

But then we have a lot of other places that we dereference, say in an arithmetic expression. There we don't want to use the ';' generally, so any non-identifier character should signal to pop out of dereference mode, but that character then needs to be interpreted in the next context.

If the dereferencing happens in default mode, we would have to accept that a ';' would always pop the mode, to handle the string case, but ';' can mean other things that don't necessarily work with popping the mode.

At the moment, I have to duplicate dereferencing tokens in several lexer modes, which in turn means duplicating parser rules for all those modes, not ideal.

renesugar commented 4 years ago

less() like yyless() would let you match lookahead characters in ANTLR4 without resorting to another programming language.

https://github.com/php/php-src/blob/master/Zend/zend_language_scanner.l

<ST_DOUBLE_QUOTES,ST_HEREDOC,ST_BACKQUOTE>"$"{LABEL}"->"[a-zA-Z_\x80-\xff] {
    yyless(yyleng - 3);
    yy_push_state(ST_LOOKING_FOR_PROPERTY);
    RETURN_TOKEN_WITH_STR(T_VARIABLE, 1);
}

Using less() to leave matched characters in the input stream (lookahead expressed in ANTLR4):

StDoubleQuotes: '$' LABEL '->' [a-zA-Z_\x80-\xff] -> less(3), pushMode(StLookingForProperty), type(Variable);

Using lookahead characters to check for pattern (lookahead expressed in a specific programming language):

StDoubleQuotes: '$' LABEL {(self._input.LA(1) == ord('-')) and (self._input.LA(2) == ord('>') and None != re.match(r'[a-zA-Z_\x80-\xff]', self._input.LA(3))}? -> pushMode(StLookingForProperty), type(Variable);

Implementing less() in lexer:

class LexerWithLess(Lexer):
  def __init__(self):
    super(LexerWithLess, self).__init__()

  def less(self, x):
      return self._input.seek(self._input.index() - x)
StDoubleQuotes: '$' LABEL '->' [a-zA-Z_\x80-\xff] { self.less(3) } -> pushMode(StLookingForProperty), type(Variable);