antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.13k stars 3.28k forks source link

Add ability to get all possible next tokens for autocompletion use #1428

Open parrt opened 7 years ago

parrt commented 7 years ago

Bruno LETRESTE points out:

i use yours getExpectedToken after i move to position in my algorithm my algorithm check the tokens and try to find the real path (in ATN tree) it stock the potential terminate for solve for exemple (and add potential to expected token)

expr : '(' expr ')'
       | expr '+' expr
       | ID
       ;

for exemple in this case, if i have ( ID I get expected token return + but it should have ) as well.

parrt commented 7 years ago

Dup of https://github.com/antlr/antlr4/issues/1283

parrt commented 7 years ago

Not a dup, but related to #1283

parrt commented 7 years ago

It turns out this is a feature request not a bug. ATN.getExpectedTokens() exists for use with error handling in and executing parser with a known specific context (call stack). In other words, it is not designed to give you the set of all tokens that could follow from any context that is potentially viable (i.e., I don't believe it is the generic global FOLLOW operation). It is always specific to a single context.

This proposed feature is used for auto completion. The idea is to do something similar to getLookaheadParseTrees(). Given some input, here we have ( ID we want to know what can come next. We don't need to know ATN position at all, just the input.

@sharwell says, I use a Schrodinger token and take the union of results from getLookaheadTrees (roughly). ... handle cases like the user typing 'i' (an ID), but could be in the middle of typing 'if' (a keyword with its own token type). ... code completion after 'i' changes depending on where you're talking. Is it i or is it i<space>? (space makes the difference)

The approach seems to be: Rewind the input, inject the special token at the end or wherever user wants the expected tokens. Then allow match() to match the special Schrodinger token and record the associated token it was expecting, making sure not to add an edge on Schrodinger's token to the various decision DFA. The collection of all tokens matched to the Schrodinger token is the auto completion set.

parrt commented 7 years ago

Some more details. It looks like ParserATNSimulator.getReachableTarget() should be overridden to check for the special token (or perhaps just record the index into the token stream where such a special token would reside to avoid altering the stream. Override addDFAEdge() to prevent adding edges for the special token. Another way to handle this is simply to add the current input token to our set of possible tokens when we reach a specific index in the stream and then make getReachableTarget() return null. That means we don't have to modify addDFAEdge(). Then we need to override Parser.match() so that we add to the possible tokens set at the proper index and then throw ParseCancellationException.

It seems to me that either we will get ParseCancellationException leading to a single possible token or we will get a set of tokens because the input leads to a decision. The decision-making will collect all of the possible symbols but, since there are no reachable edges (we return null from getReachableTarget() at the relevant index), we get NoViableAltException. Actually, we won't always get that exception. In the following grammar, input A would be augmented to A# for # as the Schrödinger's token. The prediction mechanism would collect {B, <EOF>} but would return a prediction for alternative 2. We need to set a flag so that adaptivePredict() throws an exception if we ever reach the special token anywhere during a prediction.

a : A B
   | A
   ;

Then we need just use a BailErrorStrategy and a normal parser.

OTOH, it might be better to use the ParserInterpreter so that this can be done without having to generate a parser and compile it. For example, this is what one might need to use it within a plugin or some other environment. That is more complicated because it has a big switch statement that we would have to enhance to provide a hook:

case Transition.ATOM:
    match(((AtomTransition)transition).label);
    break;
case Transition.RANGE:
case Transition.SET:
case Transition.NOT_SET:
    if (!transition.matches(_input.LA(1), Token.MIN_USER_TOKEN_TYPE, 65535)) {
        recoverInline();
    }   
    matchWildcard();
    break;

@sharwell just pointed out we need equivalent of -Xforce-atn if we use the generated parser.

mike-lischke commented 7 years ago

What surprises me about this and other similar solutions is that they are always targetted only at the next possible tokens, which is only partially useful (keywords mostly). What do you want with the information that your ID rule is a possible candidate? This doesn't give you any means to determine if you need a class name or a constant etc.

I have written a code completion core which I plan to publish under GPL within the next few months, which deals with that. It's a single class solution that uses the ATN to get the necessary information together with a rule path that adds context which can be used to look up symbols in a symbol table.

Could well be this solution also solves this issue here.

parrt commented 7 years ago

@mike-lischke perhaps BSD license? ;)

parrt commented 7 years ago

Thought about this again and need to clarify what getExpectedTokens is and is not capable of. First, it should be able to get {), +} for the above grammar.

Given a specific state and a call stack, getExpectedTokens should most definitely be able to find all possible tokens that can follow.

What it is not meant to do is answer a different question:

Given a specific partial input phrase, return the set of all tokens that can follow the last token in the input phrase.

The big difference here is that the input could land us right in the middle of a lookahead decision. We are not specifying an ATN state nor a call stack. That information would indicate that the parser had reached a specific location within the grammar. Instead, we are simply specifying an input phrase and letting the parser figure out where it could reach.

The upshot is that this is a bug in the LL(1) analyzer for the above grammar and so I am going to split the feature request out or maybe the bug. created separate bug https://github.com/antlr/antlr4/issues/1480

myieye commented 7 years ago

@parrt Wouldn't an auto-complete tool also want to take the current state into consideration? I.e. isn't the expected behavior of getExpectedTokens exactly what an auto-complete would be looking for?

mike-lischke commented 7 years ago

The auto-completion feature looks actually for more than what getExpectedToken can deliver. That function only returns tokens as defined in the grammar. A real auto-completion features wants much more than that, namely all object names allowed in a particular context.

ssdimmanuel commented 6 years ago

@parrt @mike-lischke is there an update on this feature ?

mike-lischke commented 6 years ago

Ah, good you reminded me. The C++ implementation is on Github since a few weeks already. However, there's no readme or other document in that repo for this code. However, I have a port (or rather a parallel development) of this idea written in Typescript (and there are ports of that code to Kotlin and Java). And this repo also contains some documentation how to configure and use the engine. Let's discuss there if you have any questions.

robmartin-scibite commented 4 years ago

Is there a planned solution for Java?

eschava commented 4 years ago

Is there a planned solution for Java?

You can check https://github.com/ftomassetti/kanvas project. It's implemented in Kotlin which is very close to Java and autocompletion implemented there is working great

mike-lischke commented 4 years ago

@robmartin-scibite What about this code: https://github.com/mike-lischke/antlr4-c3/tree/master/ports/java?

kaby76 commented 4 years ago

I've noticed with Mike Lischke's implementation that it does not compute the lookahead sets correctly in some situations. In particular, for Antlr grammars, if I want the lookahead set after a parser rule semicolon, I get an incomplete set that also includes epsilon but should not. Starting with Mike's implementation, and Parr's LL(*) paper, I derived a similar algorithm (http://codinggorilla.com/?p=2449) and have C# implementations for code completion in an LSP server and for error reporting in a Net Core Antlr "Hello World" template (https://github.com/kaby76/Antlr4BuildTasks/blob/master/Templates/templates/AntlrCAProject/LASets.cs). Note, the design is recursive and makes a call per an edge/token match, so I don't doubt stack overflow issues in construction of a parse. One step at a time.

martinda commented 4 years ago

This is not limited to auto-completion use, like the issue title suggests.

timboudreau commented 4 years ago

Another use case for this, which I am wrestling with at the moment: Clearer ambiguity reporting (and possibly analysis and auto-fixing) - specifically: You know there is an ambiguity in some rule, and want to tell the user clearly why the ambiguity exists and where - in a complex grammar, it may involve quite a bit of indirection before you hit the actual root cause. A simple solution to this: Collect all paths from all alternatives in the ambiguity, to all possible leading tokens that could match that alternative. Prune out anything with only one path and you have your answer, the locations within the code, and the starting point for analyzing if there's a transform possible on the syntax tree that eliminates the ambiguity without altering semantics.

parrt commented 4 years ago

I have some really cool visualizations of ambiguities in the plug-in I built so you should check that out for some ideas :)

On Mon, Sep 28, 2020 at 8:15 PM Tim Boudreau notifications@github.com wrote:

Another use case for this, which I am wrestling with at the moment: Clearer ambiguity reporting (and possibly analysis and auto-fixing) - specifically: You know there is an ambiguity in some rule, and want to tell the user clearly why the ambiguity exists and where - in a complex grammar, it may involve quite a bit of indirection before you hit the actual root cause. A simple solution to this: Collect all paths from all alternatives in the ambiguity, to all possible leading tokens that could match that alternative. Prune out anything with only one path and you have your answer, the locations within the code, and the starting point for analyzing if there's a transform possible on the syntax tree that eliminates the ambiguity without altering semantics.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1428#issuecomment-700400343, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLUWNZBGU57KMS7UNHUFLSIFGMRANCNFSM4CX5FQBA .

-- Dictation in use. Please excuse homophones, malapropisms, and nonsense.

mike-lischke commented 4 years ago

@mike-lischke https://github.com/mike-lischke - in case you have any interest, here is a patched version of CodeCompletionCore https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/master/antlr-code-completion/src/main/java/org/nemesis/antlr/completion/grammar/CodeCompletionCore.java with, in my benchmarks (including 1500 warmup passes) around a 7x performance improvement using bitset-based primitive-array+binary-search based collections (ignore the references to CCLog - it's used in unit tests to compare with the same log from the original to prove I didn't change the logic).

Very interesting. I guess you cannot back port that to TS? At least you could clean up this code and open a PR so that I can merge that in the antlr4-c3 repo.

Mike -- www.soft-gems.net

HK-SHAO commented 11 months ago

I'm looking forward to starting with a set of syntax, typing in the characters one by one, and finding all the possibilities for the next character that matches the syntax, e.g. in JSON a colon must be followed by double quotes or a whitespace character. This will help match syntax-matching fragments from a large amount of text, and is also very useful in LLM's reasoning bootstrapping