Improve performance of canonical collection building and CLR to LALR compression

DmitrySoshnikov commented 6 years ago

Currently building of the canonical collection, and compression from CLR(1) to LALR(1) are the longest operation, and for some grammars can be very slow (up to several minutes). We should investigate better compression algorithms here.

time syntax-cli -g ~/grammar.g -d -m lalr1 -o ~/test.py

DEBUG mode is: ON

[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.081ms

Parsing mode: LALR(1).
[DEBUG] Building canonical collection: 46531.131ms
[DEBUG] Number of states in the collection: 5311
[DEBUG] Compressing CLR to LALR: 15057.025ms
[DEBUG] Number of states after compression: 299
[DEBUG] Building LR parsing table: 12.715ms
[DEBUG] Generating parser module: 24.013ms

✓ Successfully generated: /Users/dmitrys/test.py

real  1m2.014s
user  1m1.273s
sys 0m1.768s

RobertSzefler commented 6 years ago

I believe memory usage is a problem as well here:

    Command being timed: "syntax-cli -g parser.g -d -m lalr1 -o parser.py"
    User time (seconds): 163.70
    System time (seconds): 0.69
    Percent of CPU this job got: 101%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:41.41
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1507252
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 542240
    Voluntary context switches: 438
    Involuntary context switches: 611
    Swaps: 0
    File system inputs: 0
    File system outputs: 160
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

DmitrySoshnikov commented 6 years ago

@RobertSzefler, below are the numbers on my machine with your second grammar (before, and after):

Before, LALR by CLR1:

./bin/syntax -g ~/Downloads/parser2.g -m lalr1 -d -o ~/p.py

DEBUG mode is: ON

[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.348ms

Parsing mode: LALR(1).
[DEBUG] Building canonical collection: 75418.586ms
[DEBUG] Number of states in the collection: 6173
[DEBUG] Compressing CLR to LALR: 48564.498ms
[DEBUG] Number of states after compression: 323
[DEBUG] Building LR parsing table: 12.895ms
[DEBUG] Generating parser module: 7.739ms

✓ Successfully generated: /Users/dmitrys/p.py

real    2m4.381s
user    2m16.806s
sys 0m2.588s

After, LALR by SLR1:

./bin/syntax -g ~/Downloads/parser2.g -m lalr1_by_slr1 -d -o ~/p.py

DEBUG mode is: ON

[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.261ms

Parsing mode: LALR1_BY_SLR(1).
[DEBUG] Building canonical collection: 105.643ms
[DEBUG] Number of states in the collection: 323
[DEBUG] LALR-by-SLR: Building extended grammar for LALR: 100.019ms
[DEBUG] Building Follow sets: 950.017ms
[DEBUG] LALR-by-SLR: Group extended productions by final sets: 30.003ms
[DEBUG] LALR-by-SLR: Updating item reduce sets: 1.243ms
[DEBUG] Building LALR-by-SLR: 1115.972ms
[DEBUG] Building LR parsing table: 13.902ms
[DEBUG] Generating parser module: 6.018ms

✓ Successfully generated: /Users/dmitrys/p.py

real    0m1.467s
user    0m1.408s
sys 0m0.080s

Currently it uses experimental lalr1_by_slr1 grammar mode. I'll do some more tests, and will make it default lalr1 mode (the old one slow will be renamed to lalr1_by_clr1). Please do the tests as well on your machine, and let me know if you see any issues.

DmitrySoshnikov commented 6 years ago

Reopen automatically closed issue, until the API is stable.

DmitrySoshnikov commented 6 years ago

Default "LALR(1) by SLR(1)" is published in v. 0.1.0, and is now available as just --mode lalr1.

DmitrySoshnikov / syntax

Improve performance of canonical collection building and CLR to LALR compression #52

Before, LALR by CLR1:

After, LALR by SLR1: