Closed Lotes closed 2 years ago
I think this is a bug. But I also have a custom token builder (inheriting from default). But there the build-Keyword-methods are not overwritten.
This isn't a bug, but by design. buildKeywordToken
is only called for keywords, however in your case you build a terminal rule which only contains a keyword. Given that we use the chevrotain lexer, we need to pass the longer_alt
into our tokens. Identifying whether a terminal rule is a longer alternative for a keyword is fairly trivial. However, doing the same for terminal rules and other terminal rules isn't as easy and I don't think there's a real 100% solution to this issue. So I'd call this feature out of scope for Langium.
This issue can be easily solved by just defining import
as a normal keyword. Note that we transform all terminal rules into regex, even those who are just a keyword.
So, a terminal IMPORT is converted to a regex and my identifier too, called ID. I saw that keywords K are checked against all terminals t \in T (non-keywords) where {ID, IMPORT} \subsetof T. The check is ˋisPrefix(K, t)ˋ for any t… But actually we need some check for all pairs of tokens whether the first token of this pair is prefix of the second token. If they are all regexes, we could build an automaton to find these relations. Creating such automatons is a bit effort, but I once did it within CSharp. Maybe there is already some implementation. From my computer science study times I know that this problem is solvable.
From my computer science study times I know that this problem is solvable.
Are you talking about intersection automata? While we can prove that two automata do have a common subset of words they can parse, I'm not sure whether this already satisfies the prefix condition. For example, /import/
and /import/
parse the same words, but neither is a prefix of another one. Or is there another well-known solution for this?
Nevertheless, this is still probably solvable. One has to identify whether an accepting state of one automaton is a non-accepting state of another automaton, without this property being symmetrical (I believe this would violate the rules of longer_alt
). Maybe intersection automata can help with this. However, the solution will have to be quite fast for that, as we need to perform this calculation on every language-server startup. I believe that the effort is not worth it, given that it is fairly trivial to add the longer_alt
property manually to the TokenBuilder
and the problem is quite rare (so far no one has complained about that part of the lexer since Langium has existed).
Yes, automata magic (thx for the right plural xD). You have two automata. Walk every route of the potential prefix automaton that leads to an accepting state. In parallel you go the same route in the other automaton. If you end up in a state that somewhen can reach an accepting state you have a prefix. Otherwise not.
But you are right. It is easier to add it manually or make the terminal a keyword instead.
Closing the issue, because there are solutions.
Langium 5.0:
I noticed the following: I am trying to recognize ANTLR4 grammars. Like the WREN.G4 grammar. Here are the critical lines of the ANTLR grammar:
Now the
ANTLR4.langium
has a set of keywords, defined as terminals...So, the actual problem is that
importModule
is recognized as IMPORT token and not as LOWER_CASE_ID. I was able to solve this using the following code instead:\b
is a word break within RegExp.I think this is a bug. But I also have a custom token builder (inheriting from default). But there the build-Keyword-methods are not overwritten.