antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
16.97k stars 3.26k forks source link

Incremental parsing using ANTLR4 #1104

Closed nicksuslov closed 7 years ago

nicksuslov commented 8 years ago

Hi We use ANTLR4(in c#) in custom IDE to parse object-oriented language. Results from parser help us create some set of features (code highlighting, code assistance, region outlining etc.). But for future improvements of usability we need to make working ANTLR4 incrementally. For example if user type some code in method body, we don't want to reparse all file (because it can be very large), we want reparse only block that affected after user typing. Good example of incremental parsing is Papa_Carlo, but it use PEG grammars to generate parser. We want to use ANTLR for this.

Is there any way to do this? It's possible?

parrt commented 7 years ago

it would be easy to make the parser do this but not the lexer. sorry.

sharwell commented 7 years ago

@parrt Don't you mean the other way around? I found incremental lexing very easy to implemented in IDE extensions, but for incremental parsing I had to get ... "crafty", and even then only partially supported.

parrt commented 7 years ago

Nah, incremental top-down parsing is trivial with a simple check at the start of each rule function. Incr Lexing in general is much harder as you don't know much you have to relex. I.e., how far back to go. E.g., see https://www.researchgate.net/publication/2377179_Efficient_and_Flexible_Incremental_Parsing

marcospassos commented 7 years ago

@parrt, in our case, our lexer is pretty simple. We don't use modes, just pattern matching. Could you give us some guidance on how implement it (lexer and parser)? It looks like some people are interested in this feature for the same purpose (IDE assistance). Please, share your extensive know-how with us :)

KvanTTT commented 7 years ago

@marcospassos maybe you need for a scannerless parser? See this issue for detail: #814.

parrt commented 7 years ago

There are papers that describe incremental lexing, but honestly I didn't understand all the details. Incremental parsing is straightforward. you put a gate at the start of every parsing method that checks to see if this combination of input positioning i and parsing method has been seen previously. If so, seek the token input position to where this parsing method left off the last time at i. Then just return. The termination condition is something like if the position where I want to parse for real again is in between the start and stop of the current method, then parse for real and don't return. does that help?

dberlin commented 5 years ago

Hopefully as a way to help get people to stop asking you (it looks like it pops up every so often), here is a working version for the ANTLR4TS target:

https://github.com/dberlin/antlr4ts/tree/incremental

The most annoying/expensive part is changing the old token references in the reused portions of the tree (since they contain line/start pos/etc info that is now wrong) . If you only care about the text being correct, you can drop that part.

I'm happy to push the stuff that should be in core into core (the grammar option and guard rule api, really), if there is any real interest.

The included tests test it on the JavaLR grammar, etc. I've run it on fairly complex grammars. That does not mean there are no bugs, but the underlying "theory of how it works" should be sound.

There is an IncrementalParsing.md file that explains in detail what we do (but is just a an expansion of what Terrence explained)

NickStrupat commented 4 years ago

I believe you need live indexes that map byte locations in the file to lexer tokens and parser nodes. Then when you receive from the editor a text document change, you can look up the tokens which straddle the start and end positions of the change, then grab the beginning byte location of the start of the first token and the last of the second token. Now you have the two boundary locations where you can start the lexer again, then the parser. Update the AST and I think that's it (at least for a naive implementation).

dberlin commented 4 years ago

This is not quite enough, unfortunately.

You also need to know the lookahead/lookback used by each token, and take that into account when deciding what "straddles" the changed boundaries.

See the underlying research papers listed here: https://tree-sitter.github.io/tree-sitter/#underlying-research

This requires tracking the lookahead/lookback used on a per-token basis.

If you are willing to sacrifice optimality for simplicity, you can simplify this greatly, and just track the maximum lookahead/lookback used ever during a given lex/parse, and assume everything used the max.

For most grammars, this will be "1" or "2", and so you will relex maybe an extra token.

(The patches i posted on incremental lexing/parsing only track the maximum lookahead/lookback rather than do it on a per-token basis).

On Sat, May 23, 2020 at 11:03 AM Nick Strupat notifications@github.com wrote:

I believe you need live indexes that map byte locations in the file to lexer tokens and parser nodes. Then when you receive from the editor a text document change, you can look up the tokens which straddle the start and end positions of the change, then grab the beginning byte location of the start of the first token and the last of the second token. Now you have the two boundary locations where you can start the lexer again, then the parser. Update the AST and I think that's it (at least for a naive implementation).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1104#issuecomment-633107418, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPI2ZWIDKOSZ5XX4ET66DRTAFWBANCNFSM4B2M2QNQ .

NickStrupat commented 4 years ago

Ah yes, of course. You're right. Lexer needs to know the starting point of the earliest possible byte that could yield a valid token.

NickStrupat commented 4 years ago

EDIT: My apologies, I haven't yet read the research you linked to.

It seems to me that information is stored in the grammar.

What if, for any given "local" token, the lexer engine searches backward... whittling down the possibilities of earliest token, then stopping there before moving forward again with incremental parsing.

dberlin commented 4 years ago

This would be correct that it was stored mostly in the grammar if the lexer/parser did not allow semantic predicates where you can lookahead/etc.

Because those are written in actual code, you can only tell what they do at runtime.

For example: JavaLetter : [a-zA-Z$_] // these are the "java letters" below 0xFF | // covers all characters above 0xFF which are not a surrogate ~[\u0000-\u00FF\uD800-\uDBFF] {Character.isJavaIdentifierStart(_input.LA(-1))}? | // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF [\uD800-\uDBFF] [\uDC00-\uDFFF]

{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}? ;

Note the lookback with the LA(-1) and LA(-2).

This is all pretty common.

Now, that said, as the papers i linked to describe (and as tree sitter, the system that is part of the docs for), you can do a variant of what you are suggesting.

I also explained how you mght try to drive this for ANTLR over on the antlr4ts issues list (all my work started in antlr4ts and has been backported) and the .md files in the patches there.

The original papers are written for LR, it's much easier in LL because it's top-down.

So you can have the text change -> identify highest level rule to restart on -> ask lexer for tokens as you go, and it re-lexes/tells you if anything changed as you ask.

The token streams in ANTLR are not currently on-demand enough to do this (IE the default is not an on-demand randomly accessible token stream, but a prebuilt token stream)

On Tue, May 26, 2020 at 10:34 AM Nick Strupat notifications@github.com wrote:

It seems to me that information is stored in the grammar.

What if, for any given "local" token, the lexer engine searches backward... whittling down the possibilities of earliest token, then stopping there before moving forward again with incremental parsing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/1104#issuecomment-634169290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPI27I57JJ7F3UYNXFLDDRTP4TNANCNFSM4B2M2QNQ .

ericvergnaud commented 4 years ago

Hi, may I suggest that you move this conversation to the google discussion group? Cheers