qzmfranklin commented 5 years ago

Hey guys,

I love how chevrotain is structured. However, before I can start deploying it to many projects that our company is working on, I'd like to query the intention of the community on the following two features:

Add a streaming lexer API.
Add incremental parser support.

Potential impact on this project:

Greatly extend reach of this project to many use cases where analysis needs to happen while the code is being edited, e.g., in an editor like VSCode.
Allow existing users to option to optimize their existing code for performance using the new API without having to completely rewrite their code that is already built with chevrotain.

For a decent reference, the TypeScript compiler is completely incremental.

Would like to hear the stance on these two issues.

Thanks, Zhongming

bd82 commented 5 years ago

Hi @qzmfranklin

Streaming Lexer

This is something I'd like to have in Chevrotain. That is because:

it would allow parsing languages such as JavaScript that have ambiguities that are easily resolved by taking into account the parser context
- https://stackoverflow.com/questions/5519596/when-parsing-javascript-what-determines-the-meaning-of-a-slash
Additionally, getting rid of the token vector would likely provide a nice performance boost.

I have investigated this in the past and even had a working streaming Chevrotain JSON parser. However I am not sure if all the existing features of Chevrotain could be ported to streaming lexer semantics, so I decided to initially only provide the APIs to allow end users to use custom lexers (which could be streaming).

Attempting to implement that has caused some performance progressions which is why it has halted. I think I now have a new idea how to implement this without suffering performance regressions, so hopefully wiring in (custom) streaming lexers would be possible in the future. But I am not sure if the basic streaming lexer functionality would ever be part of the library itself, it is just too early to tell.

See: #528

bd82 commented 5 years ago

Incremental Parsing (and Lexing)

The building blocks for this functionality already exist.

Lexing

Given a text we produce a Token vector.
The text passed to the tokenizer does not have to be the whole text.
Given a change in a range of the original text we could:
- Re-tokenize only the changed range.
  - Note we will need the changed range + some characters to avoid slicing an existing token in the middle, for example: "changed range + characters until nearest whitespace)
- Update the position information of the tokens in the range and after the range.

So an incremental Lexer could be implemented using the existing building block. Perhaps the existing Lexer APIs could be upgraded to receive starting position values So the newly partial lexed result would be produced with the correct position information without the need for additional modification.

As before I'm not sure this can be generalized enough to be provided as an official feature. As some Chevortain lexers may have more state, for example those using Lexer Modes. So such logic may best be left at the hands of the end user.

But is this really needed?

On my 3 years old laptop Lexing JSON is ~3 million lines per second. Lexing the much more complex CSS grammar is a ~750,000 lines per second.

So assuming a really large file of 10,000 lines in an editor. we are talking about trying to optimize a scenario (for humans) that only takes 3-13ms.

Would a human even notice such a thing?
What if we consider a more reasonable file size of 1,000 line, now we are talking about 0.3-1.3ms...

Partial Parsing

I will add info on this later.

bd82 commented 5 years ago

Incremental Parsing

This is similar to incremental lexing:

The low level building blocks for incremental parsing already exist:
- Each rule can be invoked directly. (no single top level rule).
- Each rule will create a Concrete Syntax Tree (CST).
- The CST can be manipulated by the end user, e.g remove subtrees and replace with different sub trees.
The (human noticeable) performance benefits in the common IDE scenarios may be negligible.

Next Steps

I think that the first thing to do is to try to create an incremental lexer/parser using Chevrotain where the incremental logic would be external to Chevrotain (using the building blocks described above). Then it would be possible to evaluate the actual performance benefits in real world scenarios. And also evaluate how generic these algorithms would be and how orthogonal to other Chevroain capabilities.

Such an incremental parser POC could be used as an "advanced example", may lead to API changes to better facilitate the scenario, or if proven good/generic enough have its logic integrated into chevrotain.

I will open an issue for such an example, But I am not sure when I will get around to investigating it. PR are always welcome if you wish to try playing with these concepts. 😄

bd82 commented 5 years ago

845

qzmfranklin commented 5 years ago

@bd82 Great replies!

Thanks for the information on how the building blocks for implementing incremental lexers/parsers are already there within chevrotain.

Regarding your question on the usefulness of incremental parsing:

So assuming a really large file of 10,000 lines in an editor. we are talking about trying to optimize a scenario (for humans) that only takes 3-13ms.

Would a human even notice such a thing? What if we consider a more reasonable file size of 1,000 line, now we are talking about 0.3-1.3ms...

Some of the most prominent use cases I had in mind were:

Handle more complex programming languages. The performance examples you brought up, namely JSON and CSS, are very simple programming languages to lex and parse. Many others are much more complex and slow to parse. This could very easily include C++, Java, Python, Javascript/TypeScript etc.. Until now, I have not found a Chevrotain based lexer/parser for any such complicated programming language. Unless a reference implementation is found, I would like to reserve my opinion (mostly guesswork) on how fast a Chevrotain based lexer/parser could be for those more complex programming languages.
Fast language services for existing and new programming languages. It has been my belief - a mixture of intuition and guesswork - that a lot of functionalities of language services rely on incrementally parsing the code while the user is editing to function without user noticeable latency. I have not researched this field, could be wrong. The way I have experienced with many exsiting IDE tools, including those with IDEA, Eclipse, Vim, VSCode, etc., could easily become sluggish when many source files are open and the one I am editing is very long (>5k lines).

Regarding your explanation on how incremental parsing (possibly) could not be ported for all features of chevrotain, including performance regression:

I'd like to query more details about the test you did on incremental parser: How much performance regression are we talking about? And for exactly what test set up? I am asking for incremental lexing/parsing, not partial lexing/parsing. Did that test do both or just incremental? Maybe something I could try?
I believe that people focusing on being able to incrementally lex/parse code texts do NOT necessarily feel extremely strongly about having every feature chevrotain has. A basic feature set, possibly including a decent mixture of allowing lexing modes, identifier resolution, LL(k) parsing should be good enough for many. Is it possible to define and port such a 'basic feature set' to be incremental? My guess - a little bit more informed than the other few - is yes.

bd82 commented 5 years ago

Hello again @qzmfranklin

Handle more complex programming languages.

Have a look at this ECMA5 grammar: https://github.com/SAP/chevrotain/blob/master/examples/grammars/ecma5/ecma5_parser.js

It is a parser only, as ECMAScript cannot be lexed without being parsed (easily).

Also I suggest you checkout the chevrotain repo and inspect this folder for the performance benchmark I use when developing (to avoid regressions). https://github.com/SAP/chevrotain/tree/master/benchmark_web

I got around 180K LOC per second for ECMA5 (no lexing, parsing only and creating a CST). (~3K LOC sample, 60 OPs/second).

I believe you could get to situations where the benefits of an incremental parser would be noticeable, but that would require a combination of slow JS runtime (not V8) quite large files, un-optimized grammars, slow CPU...

I may have some time to attempt and bring a Java Grammar of Chevrotain to production quality

https://github.com/thorbenvh8/chevrotain-java That may prove another interesting performance scenario.

Fast language services for existing and new programming languages.

Chevrotain was originally created to build a parser for both a compiler and a language service for a single language. So I certainly agree that incremental parsing is a better general solution than relying on users having sanely sized files and decent hardware 😄.

Regarding your explanation on how incremental parsing (possibly) could not be ported for all features of chevrotain, including performance regression:

I had regressions with streaming parser (which I think I can solve now) I never implemented more than a mini demo for incremental parsers, so I don't know what kind of regressions (if any) there would be.

I believe that people focusing on being able to incrementally lex/parse code texts do NOT necessarily feel extremely strongly about having every feature chevrotain has

I think it would be possible to get some incremental capabilities from Chevrotain with such a basic feature set, incremental lexing for example seems fairly trivial. and for parsing one may not need the most generic algorithm, e.g in ECMAScript re-parsing the most closely containing function may suffice for 95% of use cases.

That is why I think that creating a POC that implements incremental parsing (e.g using ECMA5 grammar?). as an external user is the correct next step, we could use such a POC to evaluate the performance benefits (in real world cases) and consider if any official APIs need be modified to assist.

The more general concept of built in support for incremental parsing in Chevrotain would require much deeper investigation into relevant literature and evaluation if such a feature is orthogonal to the existing feature set.

Its is a great feature to have, I'm just not sure I will have the time to deeply investigate it in the near future, But I will get around to it at some point 😄

qzmfranklin commented 5 years ago

@bd82 Truly glad to have all my questions thoroughly! Thank you sincerely!

I am leading a small team creating a new Python-like DSL. For our use, we need to create various language service features, including but not limited to auto-completion, contextual suggestion, snippets, etc.. The information you brought up in this thread is super valuable. We are short on time too, so probably won't spend time on implementing the incremental ECMAScript parser. But inspecting incremental parsing for our own DSL is particularly relevant. Will keep this thread posted if and when we get there.

Again, thanks for your help. Truly helpful.

bd82 commented 5 years ago

so probably won't spend time on implementing the incremental ECMAScript parser. But inspecting incremental parsing for our own DSL is particularly relevant.

That makes sense, I am not sure of the context/constraints (IP/Legal) you are working in, but if you reach positive results, contributing back even a tiny example of an incremental parser (e.g incremental JSON parser example), would be greatly appreciated.

Also note that because Chevrotain has no code generation phase (the engine is in the super class of your parser), You can easily override internal APIs implementations, for example the Java Parser linked above did such an override to collect comments on the CST... So you are not limited to external APIs when creating your modifications.

Good Luck.

(reopen this if/when needed).

Chevrotain / chevrotain

Add support for streaming Lexer API and incremental parser API #843

Streaming Lexer

Incremental Parsing (and Lexing)

Lexing

But is this really needed?

Partial Parsing

Incremental Parsing

Next Steps

845