eclipse-langium / langium

Next-gen language engineering / DSL framework
https://langium.org/
MIT License
740 stars 66 forks source link

Tokens: Keyword wins against Identifier, when the keyword is prefix of the actual identifier #742

Closed Lotes closed 2 years ago

Lotes commented 2 years ago

Langium 5.0:

I noticed the following: I am trying to recognize ANTLR4 grammars. Like the WREN.G4 grammar. Here are the critical lines of the ANTLR grammar:

fileAtom
    : classDefinition
    | function
    | importModule
    | statement
    | block
    ;

Now the ANTLR4.langium has a set of keywords, defined as terminals...

terminal IMPORT: 'import';                 //comes before identifiers
terminal UPPER_CASE_ID: UpperCaseId;
terminal LOWER_CASE_ID: LowerCaseId;
//for completeness...
terminal fragment LowerCaseId: LowerNameStartChar NameChar*;
terminal fragment UpperCaseId: UpperNameStartChar NameChar*;
terminal fragment NameChar
   : LowerNameStartChar
   | UpperNameStartChar
   | /[0-9]/
   | /_/
   ;
terminal fragment LowerNameStartChar
   : /[a-z]/
   ;
terminal fragment UpperNameStartChar
   : /[A-Z]/
   ;

So, the actual problem is that importModule is recognized as IMPORT token and not as LOWER_CASE_ID. I was able to solve this using the following code instead:

terminal IMPORT: /import\b/;

\b is a word break within RegExp.

I think this is a bug. But I also have a custom token builder (inheriting from default). But there the build-Keyword-methods are not overwritten.

msujew commented 2 years ago

I think this is a bug. But I also have a custom token builder (inheriting from default). But there the build-Keyword-methods are not overwritten.

This isn't a bug, but by design. buildKeywordToken is only called for keywords, however in your case you build a terminal rule which only contains a keyword. Given that we use the chevrotain lexer, we need to pass the longer_alt into our tokens. Identifying whether a terminal rule is a longer alternative for a keyword is fairly trivial. However, doing the same for terminal rules and other terminal rules isn't as easy and I don't think there's a real 100% solution to this issue. So I'd call this feature out of scope for Langium.

This issue can be easily solved by just defining import as a normal keyword. Note that we transform all terminal rules into regex, even those who are just a keyword.

Lotes commented 2 years ago

So, a terminal IMPORT is converted to a regex and my identifier too, called ID. I saw that keywords K are checked against all terminals t \in T (non-keywords) where {ID, IMPORT} \subsetof T. The check is ˋisPrefix(K, t)ˋ for any t… But actually we need some check for all pairs of tokens whether the first token of this pair is prefix of the second token. If they are all regexes, we could build an automaton to find these relations. Creating such automatons is a bit effort, but I once did it within CSharp. Maybe there is already some implementation. From my computer science study times I know that this problem is solvable.

msujew commented 2 years ago

From my computer science study times I know that this problem is solvable.

Are you talking about intersection automata? While we can prove that two automata do have a common subset of words they can parse, I'm not sure whether this already satisfies the prefix condition. For example, /import/ and /import/ parse the same words, but neither is a prefix of another one. Or is there another well-known solution for this?

Nevertheless, this is still probably solvable. One has to identify whether an accepting state of one automaton is a non-accepting state of another automaton, without this property being symmetrical (I believe this would violate the rules of longer_alt). Maybe intersection automata can help with this. However, the solution will have to be quite fast for that, as we need to perform this calculation on every language-server startup. I believe that the effort is not worth it, given that it is fairly trivial to add the longer_alt property manually to the TokenBuilder and the problem is quite rare (so far no one has complained about that part of the lexer since Langium has existed).

Lotes commented 2 years ago

Yes, automata magic (thx for the right plural xD). You have two automata. Walk every route of the potential prefix automaton that leads to an accepting state. In parallel you go the same route in the other automaton. If you end up in a state that somewhen can reach an accepting state you have a prefix. Otherwise not.

But you are right. It is easier to add it manually or make the terminal a keyword instead.

Lotes commented 2 years ago

Closing the issue, because there are solutions.