Chevrotain / chevrotain

Parser Building Toolkit for JavaScript
https://chevrotain.io
Apache License 2.0
2.46k stars 201 forks source link

AT_LEAST_ONE has same automatic lookahead limitations as MANY, OR, OPTION, etc? #930

Closed dberlin closed 5 years ago

dberlin commented 5 years ago

MANY, OR, OPTION, etc all say "using the short form is recommended as it will compute the lookahead function automatically. however this currently has one limitation: It only works if the lookahead for the grammar is one."

As far as i can tell, this applies to AT_LEAST_ONE as well, but is undocumented :)

(This bit me today converting an LL(2) ANTLR grammar)

If this is not supposed to be the case, let me know and i'll attach a test grammar and testcases.

bd82 commented 5 years ago

"using the short form is recommended as it will compute the lookahead function automatically. however this currently has one limitation: It only works if the lookahead for the grammar is one."

Where is this written?

If this is not supposed to be the case, let me know and i'll attach a test grammar and testcases.

LL(k) is supported but only within the context of a rule. See: #623

Please attach a small example reproducing the problem.

dberlin commented 5 years ago

Where is this written?

Huh, it looks like Google is indexing only the older versions of the docs well. If you search google for this phrase you'll see where i pulled it from. I did not notice it was a very old version of the docs. My apologies!

Please attach a small example reproducing the problem.

Here is a grammar, it's antlr duplicate, and a small test file. The relevant piece is the Stage rule.

If you remove either of the GATE functions in the Stage rule it will fail to parse AAA.src. I believe 2 tokens of lookahead should suffice to resolve this part of the grammar unambiguously without regard to the current rule stack. But it has also been a long while since i was in the parser tool game :).

ANTLR also does fine in SLL-only prediction mode on all of my much larger testcases, and tracking its decision making shows it believes it only needs 2 tokens as well for this part of the grammar. The initial lookahead set it computes for this rule is the one i've implemented in these gate functions. However, i know it does various forms of grammar transforms, etc, so i'm open to the idea it's me and not Chevrotain.

That said, debugging the grammar shows Chevrotain definitely appears to be trying to use only one token in both cases if you remove the gates.

parser.zip

bd82 commented 5 years ago

Given this grammar:

stage: Identifier? (declaration | ifStatement)+;
declaration: Identifier IS mathExpression;
ifStatement:
    IF truthExpression THEN actionExpression (
        ',' actionExpression
    )*;

and this input:

BadMsgStage IS STG94

Do you expect the first optional Identifier to be skipped so the declaration could be parsed successfully? What if the input was?

BadMsgStage foo IS STG94

I will try peeking at this more tomorrow (2nd gate). it would speed things up if you can create an executable example that I can easily run and debug. Yours is missing the package.json and all the required dependencies and flows due to using TypeScript.

dberlin commented 5 years ago

On Wed, Mar 27, 2019 at 4:06 PM Shahar Soel notifications@github.com wrote:

Given this grammar:

stage: Identifier? (declaration | ifStatement)+;declaration: Identifier IS mathExpression;ifStatement: IF truthExpression THEN actionExpression ( ',' actionExpression )*;

and this input:

BadMsgStage IS STG94

Do you expect the first optional Identifier to be skipped so the declaration could be parsed successfully?

Yes. If it helps, this grammar is just a common prefix factoring of the following grammar.

program: stage | statements;
stage: identifier (declaration | ifStatement) +
statements: (declaration | ifStatement)+

The following would also be fine:

   program: stage | statements;
   stage: identifier statement+
   statements: (declaration | ifStatement)+

What if the input was?

BadMsgStage foo IS STG94

The compiler for this language does what the ANTLR version does - it ignores whitespace and comments, so it considers this a stage named "BadMsgStage" followed by declaration named foo.

ANTLR's tree in this case is also that: (plcProgram (stage BadMsgStage (declaration foo IS (singleExpression STG94))) )

I will try peeking at this more tomorrow (2nd gate). it would speed things

up if you can create an executable example that I can easily run and debug.

I'll cut give cutting it down a try in the next few days. I will remove everything that is not relevant for this grammar, etc.

bd82 commented 5 years ago

Lets look in this simplified case:

rule1: A? (rule1 | rule2)+;
rule2: A B;
rule3: C;

Possible Sequences:

The consumption of the optional A terminal is performed independently of the following alternation. There is one edge case see: https://github.com/SAP/chevrotain/blob/master/packages/chevrotain/src/parse/grammar/lookahead.ts#L87-L99

It seems as though if the possible Tokens after the optional are identical to the sequence of tokens inside the optional that we should lookahead farther ahead to distinguish between the alternatives.

Questions:

dberlin commented 5 years ago

Right. ANTLR, et al do more lookahead here.

I actually have backed off this for now for a different reason. It turns out antlr4ts is much faster than chevrotain. I never tried the javascript target (and am not claiming there is anything wrong with your benchmarking. In fact, it looks quite sane!)

I haven't tried it on the JSON grammar, but on my grammars, it's about 5-10x faster.

One note if you decide to benchmark it: ANTLR4TS (which is based on the optimized ANTLR4 runtime and not the reference one) does lazy prediction cache initialization. So the first parse will be much slower. On the plus side, the prediction cache is globally shared between new instances so you don't have to do anything else.

To put it simply, if you do this:

parser = new Parser() parser.program() <- this first parser will be slow parser = new Parser() parser.program() <- this one will be about 3x faster parser = new Parser() parser.program() <- this one and all subsequent will be about 5-10x faster

This is mostly independent of the text handed to it, so if you parse a directory of files, you will see timings like this:

ts-node src/scripts/timeGrammar.ts testfiles

Parsing Centroid-Lathe-Standard-ALLIN1DC-r2.src: 156.575ms Parsing Centroid-Lathe-Standard-OAK-r2.src: 56.182ms Parsing Centroid-Mill-Standard-ALLIN1DC-ATC-Umbrella-Skip-First-Count.src: 27.102ms Parsing Centroid-Mill-Standard-ALLIN1DC-ATC-Umbrella.src: 19.889ms Parsing Centroid-Mill-Standard-ALLIN1DC-BP-Boss-r2.src: 27.933ms Parsing Centroid-Mill-Standard-ALLIN1DC-r2.src: 26.837ms Parsing Centroid-Mill-Standard-OAK-ACDC-ATC-Umbrella.src: 29.179ms

(These files are all the same size and rough structure)

The identical chevrotain grammar for this takes about 100-120ms per file no matter what.

bd82 commented 5 years ago

does lazy prediction cache initialization. So the first parse will be much slower. On the plus side, the prediction cache is globally shared between new instances so you don't have to do anything else.

There used to be a global cache for computations shared between parser instances but that caused all sorts of potential cache validity related bugs and was removed in favor of instance level cache. The recommended approach is to re-use the same parser instance with Chevrotain.

Are you creating a new Chevrotain parser instance for each file ? Can you share this benchmark in a way I can reproduce it?

dberlin commented 5 years ago

I was reusing a single parser as per the performance guide. You can get the code i'm testing/using here: https://github.com/dberlin/vscode-centroid-plc

If you npm install it, run "npm run test-compile" and then ts-node src/scripts/timeGrammar.ts testfiles/ I removed the chevrotain grammar, i can find the rev that has it for you, it should be trivial to plug it back in. I later modified the timing script to not reuse instances since ANTLR didn't care.

I also slightly reworked the ANTLR grammar to not require the optional consume, but that was just to make the CST cleaner, it did not affect performance.

bd82 commented 5 years ago

o.k. I've had a look in your source code but could not get everything working properly in the time resources allocated, I would happily look into this but would need an easy to reproduce benchmark with both parsers to debug.

I've created a small benchmark comparing the CST creation speed for a JSON grammar and 1,000 Lines JSON sample in the antlr4ts branch:

These are my results on node V11.11.0

Small Json Sample antlr4TS-SLL x 380 ops/sec ±7.36% (88 runs sampled) antlr4TS-LL x 393 ops/sec ±1.19% (89 runs sampled) chevrotain-not fully optimized x 1,272 ops/sec ±2.56% (89 runs sampled) Fastest is chevrotain-not fully optimized

It would be interesting to examine your scenario, but I need a simple to reproduce use case similar to the one I provided here for the JSON grammar. The fact you are not seeing a speed up in your case when the parser "warms up" on multiple files is very interesting and requires debugging the scenario.

dberlin commented 5 years ago

I'm happy to try to give you a benchmark.

FWIW: In case you are interested, I just added incremental parsing support to ANTLR4TS - https://github.com/dberlin/antlr4ts/tree/incremental

Should be pretty easy to make work in Chevrotain if you have any interest. IncrementalParsing.md in that tree explains how it works. But it should be fairly straightforward to follow along. (My use case for it is weird, which is parsing very large gcode files that people are making small changes to. These are usually somewhere between 10 and 150 meg files, so even the fastest parsing runtimes take a few seconds minimum per parse)

On Sat, Apr 6, 2019 at 10:12 AM Shahar Soel notifications@github.com wrote:

o.k. I've had a look in your source code but could not get everything working properly in the time resources allocated, I would happily look into this but would need an easy to reproduce benchmark with both parsers to debug.

I've created a small benchmark comparing the CST creation speed for a JSON grammar and 1,000 Lines JSON sample in the antlr4ts branch:

- https://github.com/SAP/chevrotain/blob/antlr4ts/examples/grammars/json/benchmark.js

  • you can run it by install npm depedencies in the parent folder (examples/grammars).
  • and than execute the benchmark.js file.
  • The generated antlr source code is committed to the source code for ease of reproduction.
  • The relevant scripts for code generation and tsc compilation are in examples/grammars/package.json script sections.

These are my results on node V11.11.0

Small Json Sample antlr4TS-SLL x 380 ops/sec ±7.36% (88 runs sampled) antlr4TS-LL x 393 ops/sec ±1.19% (89 runs sampled) chevrotain-not fully optimized x 1,272 ops/sec ±2.56% (89 runs sampled) Fastest is chevrotain-not fully optimized

It would be interesting to examine your scenario, but I need a simple to reproduce use case similar to the one I provided here for the JSON grammar. The fact you are not seeing a speed up in your case when the parser "warms up" on multiple files is very interesting and requires debugging the scenario.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SAP/chevrotain/issues/930#issuecomment-480520971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT0a95M4tl4YOQYW9PHEO_SBaq3nlWoks5veNWKgaJpZM4cMwVU .

dberlin commented 5 years ago

https://github.com/tunnelvisionlabs/antlr4ts/compare/master...dberlin:incremental may be an easier view to see the changes.

On Sat, Apr 6, 2019 at 11:27 AM Daniel Berlin dberlin@dberlin.org wrote:

I'm happy to try to give you a benchmark.

FWIW: In case you are interested, I just added incremental parsing support to ANTLR4TS - https://github.com/dberlin/antlr4ts/tree/incremental

Should be pretty easy to make work in Chevrotain if you have any interest. IncrementalParsing.md in that tree explains how it works. But it should be fairly straightforward to follow along. (My use case for it is weird, which is parsing very large gcode files that people are making small changes to. These are usually somewhere between 10 and 150 meg files, so even the fastest parsing runtimes take a few seconds minimum per parse)

On Sat, Apr 6, 2019 at 10:12 AM Shahar Soel notifications@github.com wrote:

o.k. I've had a look in your source code but could not get everything working properly in the time resources allocated, I would happily look into this but would need an easy to reproduce benchmark with both parsers to debug.

I've created a small benchmark comparing the CST creation speed for a JSON grammar and 1,000 Lines JSON sample in the antlr4ts branch:

- https://github.com/SAP/chevrotain/blob/antlr4ts/examples/grammars/json/benchmark.js

  • you can run it by install npm depedencies in the parent folder (examples/grammars).
  • and than execute the benchmark.js file.
  • The generated antlr source code is committed to the source code for ease of reproduction.
  • The relevant scripts for code generation and tsc compilation are in examples/grammars/package.json script sections.

These are my results on node V11.11.0

Small Json Sample antlr4TS-SLL x 380 ops/sec ±7.36% (88 runs sampled) antlr4TS-LL x 393 ops/sec ±1.19% (89 runs sampled) chevrotain-not fully optimized x 1,272 ops/sec ±2.56% (89 runs sampled) Fastest is chevrotain-not fully optimized

It would be interesting to examine your scenario, but I need a simple to reproduce use case similar to the one I provided here for the JSON grammar. The fact you are not seeing a speed up in your case when the parser "warms up" on multiple files is very interesting and requires debugging the scenario.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SAP/chevrotain/issues/930#issuecomment-480520971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT0a95M4tl4YOQYW9PHEO_SBaq3nlWoks5veNWKgaJpZM4cMwVU .

bd82 commented 5 years ago

I've linked your document in the incremental parsing issue: https://github.com/SAP/chevrotain/issues/844

and yes this is very interesting 😄

bd82 commented 5 years ago

I'm happy to try to give you a benchmark.

If you can provide something I can easily reproduce and debug it would be interesting 😄

bd82 commented 5 years ago

I will close this for now. If you create a performance benchmark to debug I will investigate it, re-open this or create a new issue if/when one is available.

Regarding the original lookahead question. that is probably part of #623