Lexing does not appear to respect the declared order of lexer rules.

eclipse-langium / langium-website

Source of langium.org

https://langium.org/

MIT License

14 stars 34 forks source link

Lexing does not appear to respect the declared order of lexer rules. #164

Open NigelWSewell opened 1 year ago

NigelWSewell commented 1 year ago

Description

While writing a JavaDoc Extractor, it was seen that the Lexing rules do not appear to follow the description in the Documentation. Where it is stated:

The order in which terminal rules are defined is critical as the lexer will always return the first match.

In the First Screenshot, the grammar can be seen to be extracting the correct text in the syntax tree, so the task is therefore to define some terminal rules that ignore everything else.

Screenshot from 2023-07-16 15-18-57

Adding the 'IGNORE' rule we can see that the syntax tree has removed the earlier matches, in favour of the later 'IGNORE' rule.

Screenshot from 2023-07-16 15-19-25

This seems to be in contradiction to the expectation from the requirement about the order of terminal rules.

Grammar Used

grammar JavaDocExtractor

entry Model: (docs+=JDoc)*;

terminal JDoc: ('/**' -> '*/');

hidden terminal CR: '\r'+;
hidden terminal LF: '\n'+;
//hidden terminal IGNORE: /.+?/;

Test Input


/** foo 1 */
person John
person Jane

/* foo 2*/

Hello John!
Hello Jane!

/** foo 4*/

msujew commented 1 year ago

@NigelWSewell It seems like the documentation skipped over the small detail that we move terminals that can potentially match whitespace characters to the front as a performance optimization. See here.

Note that unlike in Xtext, it's not recommended in Langium to have a catch-all terminal. Langium's underlying lexer implementation (Chevrotain) works quite differently from ANTLR and catch-all terminals will always lead to trouble (even if the order of tokens is correct). A catch-all token will always consume the rest of the input, as even making it non-greedy doesn't work.

Instead, lexer errors are dealt with on a diagnostics level, and unexpected characters are simply omitted from the token stream.

NigelWSewell commented 1 year ago

@msujew That would explain the behaviour well eough.

Is there a workaround to this? Either:

A way of forcing strict declaration order.
Ignoring other syntax errors
A complete non-whitespace character set to catch other unwanted text.
Something else ive not thought of.

Either way im sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

p.s.: Thanks for working on Sunday!

msujew commented 1 year ago

Is there a workaround to this?

Not directly in the grammar, though you can override the DefaultTokenBuilder to prevent the behavior. We should probably add a flag to disable the optimization.

Either way I'm sure this is a question/mistake many people from ANTLR/XText will encounter so this can be a good opportunity to improve the documentation.

I assume so as well. We should probably mention that in the docs.