com-lihaoyi / fastparse

Writing Fast Parsers Fast in Scala
https://com-lihaoyi.github.io/fastparse
MIT License
1.09k stars 164 forks source link

How to fix an exponential slowness due to backtracking? #300

Closed winitzki closed 10 months ago

winitzki commented 11 months ago

I'm struggling to fix an exponential slowness in fastparse 3.0.2 (Scala 2.13) used for an expression grammar. There is too much backtracking going on, and I have been adding cuts but the exponential slowness remains.

I tried to add memoization to the parsers but this failed with strange error messages about unboxing Unit into Integer.

A small working example of a grammar that shows exponential slowness:

    import fastparse.NoWhitespace._
    import fastparse._
    // Integer calculator program: 1+2*3-(4-5)*6 and so on. No spaces, for simplicity.
    def program[$: P]: P[Int] = P(expr ~ End)
    def expr[$: P]: P[Int]    = P(x_minus | x_plus)
    def x_minus[$: P]         = P(x_times ~ "-" ~ expr)
        .map { case (x, y) => x - y }
    def x_plus[$: P]          = P(x_times ~ ("+" ~ expr).rep)
        .map { case (i, is) => i + is.sum }
    def x_times[$: P]         = P(x_other ~ ("*" ~ x_other).rep)
        .map { case (i, is) => i * is.product }
    def x_other[$: P]         = P(number | ("(" ~ expr ~ ")"))
    def number[$: P]          = P(CharIn("0-9").rep(1))
        .!.map(_.toInt)
    // Verify that this works as expected.
    assert(parse("123*(1+1)", program(_)).get.value == 246)
    assert(parse("123*1+1", program(_)).get.value == 124)
    assert(parse("123*1-1", program(_)).get.value == 122)
    assert(parse("123*(1-1)", program(_)).get.value == 0)

    // Parse an expression of the form `(((((...(1)...)))))`.
     val n = 25
     assert(parse("(" * (n - 1) + "1" + ")" * (n - 1), program(_)).get.value == 1)

The code works but parsing takes about 10 seconds. I found that parsing this expression with n parentheses takes about $2^{n-20}$ seconds. The reason for the exponential slowness is the backtracking. It tries to parse expr after ( and there are two possibilities (minus and plus). Each possibility will need to be fully explored before returning to parse the next (. To explore means again to parse expr recursively. So, there is a 2x increase of parsing work for each parenthesis. To visualize the slowness, consider this function:

def fibonacci(n: Int): Int = if (n <= 1) 1 else fibonacci(n-1) + fibonacci(n-2)

It takes $2^n$ function calls to compute fibonacci(n) by this program.

I tried adding cuts to the grammar at various places but the exponential slowness remains.

So, I have two questions:

1) Can we somehow mitigate this problem by adding cuts, or adding more grammar rules, or in some other way?

2) Memoization could help here: if a parser rule already failed at a certain location, it is not necessary to try again to parse with the same parser rule at the same location. If a parser rule succeeded at a certain location, and a parse is unique, it is not necessary to parse again at the same location. This may eliminate the exponential slowness if implemented correctly. Is there a way to implement that?

winitzki commented 10 months ago

I tried memoization like this:


    val cache_minus = mutable.Map[Int, P[Int]]()
    val cache_plus  = mutable.Map[Int, P[Int]]()

    def x_minus[$](implicit p: P[$]): P[Int] = cache_minus.getOrElseUpdate(p.index, P(x_times ~ "-" ~ expr).map { case (x, y) => x - y })
    def x_plus[$](implicit p: P[$]): P[Int]  = cache_plus.getOrElseUpdate(p.index, P(x_times ~ ("+" ~ expr).rep).map { case (i, is) => i + is.sum })
   // Other grammar rules remain unchanged.

This fails whenever the input contains parentheses:

parse("123*(1+1)", program(_))
// 
java.lang.ClassCastException: class scala.runtime.BoxedUnit cannot be cast to class java.lang.Integer (scala.runtime.BoxedUnit is in unnamed module of loader 'app'; java.lang.Integer is in module java.base of loader 'bootstrap')

    at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)

But I was unable to find the code that does this unboxing to Int. Probably, this code is generated by a macro. In any case, I don't know how to fix this problem.