EoinDavey / tsPEG

PEG Parser Generator for TypeScript
Mozilla Public License 2.0
192 stars 7 forks source link

Poor performance with simple grammars #21

Closed chancecarey closed 3 years ago

chancecarey commented 3 years ago

I love tsPEG, and was very happily using it to parse a toy programming language, but I ran into some massive performance issues while adding precedence rules.

Not sure what's causing this (perhaps it's my fault, though I'm not sure how I'd be doing anything wrong), but it's a shame, as I was otherwise really enjoying working with the library.

I've added some code to easily reproduce the issue.

Grammar:

---
//@ts-nocheck
---

stmt := 
  a=expression2 'a' b=stmt
| a=expression2 'b' b=stmt
| a=expression2 'c' b=stmt
| a=expression2 'd' b=stmt
| a=expression2 'e' b=stmt
| a=expression2 'f' b=stmt
| a=expression2 'g' b=stmt
| a=expression2 'h' b=stmt
| a=expression2 'i' b=stmt
| expr=expression2

expression2:=
  num = '[0-9]+'
| '\(' expr=stmt '\)'

Test cases:

import { parse } from "./grammar";

const time = (s: string) => {
  const start = new Date();
  parse(s);
  const end = new Date();
  console.log(`Took ${end.getTime() - start.getTime()}ms\tto parse ${s}`);
};

time(`100`);
time(`(100)`);
time(`((100))`);
time(`(((100)))`);
time(`((((100))))`);
time(`(((((100)))))`);
time(`((((((100))))))`);

Timing output:

Took 2ms        to parse 100
Took 1ms        to parse (100)
Took 7ms        to parse ((100))
Took 46ms       to parse (((100)))
Took 264ms      to parse ((((100))))
Took 2342ms     to parse (((((100)))))
Took 15726ms    to parse ((((((100))))))

Thanks!

chancecarey commented 3 years ago

I realize in the above that it's easy to factor out the right side of each statement and it would run much better, but here's another example structured how it would be for precedence rules:

---
//@ts-nocheck
---

e0 := 
  a=e1 op={'a'|'b'} b=e0
| expr=e1

e1 := 
  a=e2 op={'c'|'d'} b=e1
| expr=e2

e2 := 
  a=e3 op={'e'|'f'} b=e2
| expr=e3

e3 := 
  a=e4 op={'g'|'h'} b=e3
| expr=e4

e4 := 
  a=end op={'c'|'d'} b=e4
| expr=end

end :=
  num = '[0-9]+'
| '\(' expr=e0 '\)'
import { parse } from "./grammar";

const time = (s: string) => {
  const start = new Date();
  parse(s);
  const end = new Date();
  console.log(`Took ${end.getTime() - start.getTime()}ms\tto parse ${s}`);
};

time(`100`);
time(`(100)`);
time(`((100))`);
time(`(((100)))`);
time(`((((100))))`);
Took 2ms        to parse 100
Took 6ms        to parse (100)
Took 103ms      to parse ((100))
Took 1983ms     to parse (((100)))
Took 65254ms    to parse ((((100))))
EoinDavey commented 3 years ago

Hi,

This is a result of the recursive descent approach the parser uses. If it matches a prefix of a rule and then fails to match further, it must backtrack all the way back to the beginning, which can lead to exponentially bad parse times with certain grammars. The trick to avoid this is to design your grammar so it doesn't have to backtrack so much.

There is a feature request to add support for memoisation (packrat parsing) that would mean you wouldn't have to worry about things like this (but it would use a bit more memory than it does). I haven't been motivated to implement it, but maybe I should get around to it. Luckily some work I did recently to support left recursion will make it simple (I think).

One thing you can do is to use the optional operator (?) so that the work done matching a prefix doesn't have to be re-done multiples times if there's no binary operator found. For example:

e0 := a=e1 op={'a'|'b'} b=e0
    | expr=e1

can be replaced by

e0 := a=e1 tail={ op='a|b' b=e0}?

and the performance will be much better. The whole grammar (from your second comment) can be replaced by:

e0 := a=e1 tail={op='a|b' b=e0}?
e1 := a=e2 tail={op='c|d' b=e1}?
e2 := a=e3 tail={op='e|f' b=e2}?
e3 := a=e4 tail={op='g|h' b=e3}?
e4 := a=end tail={op='c|d' b=e4}?

end := num='[0-9]+'
    | '\(' expr=e0 '\)'

The times with this grammar are much better:

Took 1ms        to parse 100
Took 0ms        to parse (100)
Took 1ms        to parse ((100))
Took 0ms        to parse (((100)))
Took 0ms        to parse ((((100))))

tsPEG is used to generate the syntax for the Setanta programming language, the grammar for that language is here. In the Setanta grammar I use a different approach for precedence, where I use the * operator instead of recursion. Here's an example of how that would work with the grammar you gave in your second comment:

e0 := a=e1 tail={op='a|b' b=e1}*
e1 := a=e2 tail={op='c|d' b=e2}*
e2 := a=e3 tail={op='e|f' b=e3}*
e3 := a=e4 tail={op='g|h' b=e4}*
e4 := a=end tail={op='c|d' b=end}*

end := num='[0-9]+'
    | '\(' expr=e0 '\)'

The parse times for this are also better:

Took 1ms        to parse 100
Took 1ms        to parse (100)
Took 0ms        to parse ((100))
Took 0ms        to parse (((100)))
Took 1ms        to parse ((((100))))

If you don't like either of those 2 approaches, you can actually use left recursion. Because of a quirk in how left recursion is implemented, you can write the same rules left-recursively and it will be quick.

For example: e0 := e1 'a|b' e0 | e1 becomes e0 := e0 'a|b' e1 | e1.

The full grammar is here:

e0 := a=e0 op='a|b' b=e1
    | e1 
e1 := a=e1 op='c|d' b=e2
    | e2 
e2 := a=e2 op='e|f' b=e3
    | e3 
e3 := a=e3 op='g|h' b=e4
    | e4 
e4 := a=e4 op='c|d' b=end
    | end 

end := num='[0-9]+'
    | '\(' expr=e0 '\)'

Once again the parse times are much better:

Took 2ms        to parse 100
Took 1ms        to parse (100)
Took 1ms        to parse ((100))
Took 0ms        to parse (((100)))
Took 0ms        to parse ((((100))))

I hope this helps!

chancecarey commented 3 years ago

That's really helpful, I've applied the way you mention and it runs far faster.

Thank you so much for taking the time to write out this explanation, totally fixed my issue :-).