Chevrotain / chevrotain

Parser Building Toolkit for JavaScript
https://chevrotain.io
Apache License 2.0
2.46k stars 201 forks source link

Add support for streaming Lexer API and incremental parser API #843

Closed qzmfranklin closed 5 years ago

qzmfranklin commented 5 years ago

Hey guys,

I love how chevrotain is structured. However, before I can start deploying it to many projects that our company is working on, I'd like to query the intention of the community on the following two features:

Potential impact on this project:

For a decent reference, the TypeScript compiler is completely incremental.

Would like to hear the stance on these two issues.

Thanks, Zhongming

bd82 commented 5 years ago

Hi @qzmfranklin

Streaming Lexer

This is something I'd like to have in Chevrotain. That is because:

I have investigated this in the past and even had a working streaming Chevrotain JSON parser. However I am not sure if all the existing features of Chevrotain could be ported to streaming lexer semantics, so I decided to initially only provide the APIs to allow end users to use custom lexers (which could be streaming).

Attempting to implement that has caused some performance progressions which is why it has halted. I think I now have a new idea how to implement this without suffering performance regressions, so hopefully wiring in (custom) streaming lexers would be possible in the future. But I am not sure if the basic streaming lexer functionality would ever be part of the library itself, it is just too early to tell.

bd82 commented 5 years ago

Incremental Parsing (and Lexing)

The building blocks for this functionality already exist.

Lexing

So an incremental Lexer could be implemented using the existing building block. Perhaps the existing Lexer APIs could be upgraded to receive starting position values So the newly partial lexed result would be produced with the correct position information without the need for additional modification.

As before I'm not sure this can be generalized enough to be provided as an official feature. As some Chevortain lexers may have more state, for example those using Lexer Modes. So such logic may best be left at the hands of the end user.

But is this really needed?

On my 3 years old laptop Lexing JSON is ~3 million lines per second. Lexing the much more complex CSS grammar is a ~750,000 lines per second.

So assuming a really large file of 10,000 lines in an editor. we are talking about trying to optimize a scenario (for humans) that only takes 3-13ms.

Partial Parsing

I will add info on this later.

bd82 commented 5 years ago

Incremental Parsing

This is similar to incremental lexing:

Next Steps

I think that the first thing to do is to try to create an incremental lexer/parser using Chevrotain where the incremental logic would be external to Chevrotain (using the building blocks described above). Then it would be possible to evaluate the actual performance benefits in real world scenarios. And also evaluate how generic these algorithms would be and how orthogonal to other Chevroain capabilities.

Such an incremental parser POC could be used as an "advanced example", may lead to API changes to better facilitate the scenario, or if proven good/generic enough have its logic integrated into chevrotain.

I will open an issue for such an example, But I am not sure when I will get around to investigating it. PR are always welcome if you wish to try playing with these concepts. 😄

bd82 commented 5 years ago

845

qzmfranklin commented 5 years ago

@bd82 Great replies!

Thanks for the information on how the building blocks for implementing incremental lexers/parsers are already there within chevrotain.

Regarding your question on the usefulness of incremental parsing:

So assuming a really large file of 10,000 lines in an editor. we are talking about trying to optimize a scenario (for humans) that only takes 3-13ms.

Would a human even notice such a thing? What if we consider a more reasonable file size of 1,000 line, now we are talking about 0.3-1.3ms...

Some of the most prominent use cases I had in mind were:

Regarding your explanation on how incremental parsing (possibly) could not be ported for all features of chevrotain, including performance regression:

bd82 commented 5 years ago

Hello again @qzmfranklin

Handle more complex programming languages.

Have a look at this ECMA5 grammar: https://github.com/SAP/chevrotain/blob/master/examples/grammars/ecma5/ecma5_parser.js

Also I suggest you checkout the chevrotain repo and inspect this folder for the performance benchmark I use when developing (to avoid regressions). https://github.com/SAP/chevrotain/tree/master/benchmark_web

I got around 180K LOC per second for ECMA5 (no lexing, parsing only and creating a CST). (~3K LOC sample, 60 OPs/second).

I believe you could get to situations where the benefits of an incremental parser would be noticeable, but that would require a combination of slow JS runtime (not V8) quite large files, un-optimized grammars, slow CPU...

I may have some time to attempt and bring a Java Grammar of Chevrotain to production quality

Fast language services for existing and new programming languages.

Chevrotain was originally created to build a parser for both a compiler and a language service for a single language. So I certainly agree that incremental parsing is a better general solution than relying on users having sanely sized files and decent hardware 😄.

Regarding your explanation on how incremental parsing (possibly) could not be ported for all features of chevrotain, including performance regression:

I had regressions with streaming parser (which I think I can solve now) I never implemented more than a mini demo for incremental parsers, so I don't know what kind of regressions (if any) there would be.

I believe that people focusing on being able to incrementally lex/parse code texts do NOT necessarily feel extremely strongly about having every feature chevrotain has

I think it would be possible to get some incremental capabilities from Chevrotain with such a basic feature set, incremental lexing for example seems fairly trivial. and for parsing one may not need the most generic algorithm, e.g in ECMAScript re-parsing the most closely containing function may suffice for 95% of use cases.

That is why I think that creating a POC that implements incremental parsing (e.g using ECMA5 grammar?). as an external user is the correct next step, we could use such a POC to evaluate the performance benefits (in real world cases) and consider if any official APIs need be modified to assist.

The more general concept of built in support for incremental parsing in Chevrotain would require much deeper investigation into relevant literature and evaluation if such a feature is orthogonal to the existing feature set.

Its is a great feature to have, I'm just not sure I will have the time to deeply investigate it in the near future, But I will get around to it at some point 😄

qzmfranklin commented 5 years ago

@bd82 Truly glad to have all my questions thoroughly! Thank you sincerely!

I am leading a small team creating a new Python-like DSL. For our use, we need to create various language service features, including but not limited to auto-completion, contextual suggestion, snippets, etc.. The information you brought up in this thread is super valuable. We are short on time too, so probably won't spend time on implementing the incremental ECMAScript parser. But inspecting incremental parsing for our own DSL is particularly relevant. Will keep this thread posted if and when we get there.

Again, thanks for your help. Truly helpful.

bd82 commented 5 years ago

so probably won't spend time on implementing the incremental ECMAScript parser. But inspecting incremental parsing for our own DSL is particularly relevant.

That makes sense, I am not sure of the context/constraints (IP/Legal) you are working in, but if you reach positive results, contributing back even a tiny example of an incremental parser (e.g incremental JSON parser example), would be greatly appreciated.

Also note that because Chevrotain has no code generation phase (the engine is in the super class of your parser), You can easily override internal APIs implementations, for example the Java Parser linked above did such an override to collect comments on the CST... So you are not limited to external APIs when creating your modifications.

Good Luck.

(reopen this if/when needed).