eclipse-langium / langium

Next-gen language engineering / DSL framework
https://langium.org/
MIT License
721 stars 65 forks source link

Unreachable rule, changing the order does nothing. #1691

Open stutrek opened 4 hours ago

stutrek commented 4 hours ago

Unreachable rule, flipping the order of the rules does nothing. Worked in 3.1, breaks in 3.2.0

Langium version: 3.2.0 Package name: langium

Steps To Reproduce

grammar EraserDsl

entry Model:
    (Newline* title=Title)?
    (nodes+=(NodeDeclaration) | Newline)*;

Title:
    TitleText text=(ID | EscapedText);

NodeDeclaration:
    name=ID;

hidden terminal WS: /[ \t]+/;

hidden terminal ML_COMMENT: /\/\*[\s\S]*?\*\//;
hidden terminal SL_COMMENT: /(\/\/)[^\n\r]*/;
hidden terminal SL_COMMENT_HASH: /#[^\n\r]*/;

terminal TitleText: 'title';

terminal fragment IdBoundary: /[^:{}\[\]"<>,\n\t \-]/;
terminal ID: (IdBoundary '-'? (IdBoundary+ '-'? | ' '+)* IdBoundary+ '-'?) | IdBoundary '-'?;

terminal EscapedText: /("[^"]*"|'[^']*')/;
terminal Newline: /\n/;

The current behavior

Error: Errors detected in definition of Lexer:
Token: ->TitleText<- can never be matched.
Because it appears AFTER the Token Type ->ID<-in the lexer's definition.
See https://chevrotain.io/docs/guide/resolving_lexer_errors.html#UNREACHABLE

The expected behavior

It compiles

msujew commented 3 hours ago

We extended what constitutes as whitespace in https://github.com/eclipse-langium/langium/pull/1589. It's likely that this leads to an automatic reordering of ID to the "top" of the lexer, which then shadows the Title rule. This is done as a performance optimization, as whitespace-related tokens should be attempted first by the lexer.

I'd be inclined to say that the way you're defining IdBoundary is a bit of an anti-pattern. While it's something that we can likely improve on in the lexer, it matches a bunch of stuff you probably don't want to be matched there. Something like the zero-width space (which is why the token builder identifies this as a "whitespace token").

Note that you can create a workaround by overriding this part of the token builder:

https://github.com/eclipse-langium/langium/blob/335f9a99b1568829cd1c0f5d3bbe33ca754daf90/packages/langium/src/parser/token-builder.ts#L56-L63

stutrek commented 3 hours ago

I see, that makes a lot of sense. Is the antipattern listing characters to ignore rather than characters to match? This particular situation is challenging because this rule is used where people want spaces and punctuation. Additionally, many of our users use non-Roman alphabets. If you have any suggestions, I would like to improve it.

For now, I added some optional whitespace to the title rule so it also gets prepended, but it doesn't fix my out of memory issue :(. I'll make a minimal case for that.

stutrek commented 3 hours ago

Update: I made it so our IDs have to start with a non-whitespace character using your list of whitespace chars and it compiles as expected.