com-lihaoyi / fastparse

Writing Fast Parsers Fast in Scala
https://com-lihaoyi.github.io/fastparse
MIT License
1.09k stars 164 forks source link

StringIn takes a very long time with longer strings #33

Closed tannerezell closed 8 years ago

tannerezell commented 9 years ago

Tested on 0.2.1:

sealed trait Val
case class Var(value : String) extends Val
val startTime = new Date()
val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq 
val variables : Parser[Var] = P ( StringIn(vars:_*).! ).map(Var)
val Result.Success(myVars, _) = variables.parse("SA")
val endTime = new Date()
val dateDiff = new SimpleDateFormat("mm:ss").format(new Date(endTime.getTime - startTime.getTime))
println (s"myVars = $myVars")
println (s"took $dateDiff to parse")

results in the following output:

myVars = Var(SA)
took 03:11 to parse

This only happens on long strings, shorter ones process almost instantly. I've tried various combinations of the long string and it always results in an exceedingly long parsing time.

lihaoyi commented 9 years ago

I wonder if it's a consequence of the way StringIn is implemented? A lot of our bitsets are implemented as Char predicates, including those used in StringIn, and those initialize a 65000-pointer long lookup array the first time they're initialized. This may be causing the slowdown, but I can't imagine it taking 3 minutes...

Could you repeat the benchmark moving the initialization (construction of the parser object) outside of the timing block? I thought the initialization would be relatively fast, but maybe it's not, and we should reconsider this approach.

The lookup array implementation lives in the trie here:

https://github.com/lihaoyi/fastparse/blob/master/utils/shared/src/main/scala/fastparse/Utils.scala#L138-L183

On Fri, Aug 7, 2015 at 12:00 AM, tannerezell notifications@github.com wrote:

Tested on 0.2.1:

sealed trait Valcase class Var(value : String) extends Valval startTime = new Date()val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq val variables : Parser[Var] = P ( StringIn(vars:*).! ).map(Var)val Result.Success(myVars, ) = variables.parse("SA")val endTime = new Date()val dateDiff = new SimpleDateFormat("mm:ss").format(new Date(endTime.getTime - startTime.getTime)) println (s"myVars = $myVars") println (s"took $dateDiff to parse")

results in the following output:

myVars = Var(SA) took 03:11 to parse

This only happens on long strings, shorter ones process almost instantly. I've tried various combinations of the long string and it always results in an exceedingly long parsing time.

— Reply to this email directly or view it on GitHub https://github.com/lihaoyi/fastparse/issues/33.

tannerezell commented 9 years ago

It looks like when parse is called, rather than when the parser is setup:

sealed trait Val
case class Var(value : String) extends Val

def time[R](unit : => R) : R = {
    val startTime = new Date()
    val result = unit
    val endTime = new Date()
    val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
    println (s"took $dateDiff to parse")
    unit
  }

  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse
took 03:27:331 to parse
myVars = Var(SA)

That seems crazy right?

lihaoyi commented 9 years ago

Initialization is lazy IIRC; what if you call parse over and over?

On Fri, Aug 7, 2015 at 5:14 AM, tannerezell notifications@github.com wrote:

It looks like when parse is called, rather than when the parser is setup:

sealed trait Valcase class Var(value : String) extends Val def time[R](unit : => R) : R = { val startTime = new Date() val result = unit val endTime = new Date() val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime)) println (s"took $dateDiff to parse") unit }

val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq val variables : Parser[Var] = time { P ( StringIn(vars:*).! ).map(Var) } val Result.Success(myVars, ) = time { variables.parse("SA") } println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse took 03:27:331 to parse myVars = Var(SA)

That seems crazy right?

— Reply to this email directly or view it on GitHub https://github.com/lihaoyi/fastparse/issues/33#issuecomment-128510908.

tannerezell commented 9 years ago
  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  val Result.Success(myVars2, _) = time { variables.parse("SA") }
  val Result.Success(myVars3, _) = time { variables.parse("SA") }
  val Result.Success(myVars4, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")
took 00:00:00 to parse
took 03:08:530 to parse
took 00:00:00 to parse
took 00:00:00 to parse
took 00:00:00 to parse
myVars = Var(SA)

Seems like only the first time, then its pretty fast

tannerezell commented 9 years ago

I tried my hand at writing an alternate 'StringIn' implementation:

case class ContainsIn(str : Seq[String]) extends Parser[Unit] {
    val sortedList = str.sortBy(_.length).reverse
    override def parseRec(cfg: ParseCtx, index: Int): Mutable[Unit] = {
      def length : Int = { sortedList foreach { p => if (cfg.input.startsWith(p, index)) return p.length }; -1 }
      if (length != -1) success(cfg.success, (), index + length + 0, Nil, cut = false)
      else fail(cfg.failure, index)
    }
    override def toString = {
      s"ContainsIn(${sortedList.map(literalize(_)).mkString(", ")})"
    }
  }

For the most part it works and is very very fast. For grins and giggles I tried to swap it in place of StringIn and ran the tests, it was very unhappy to say the least. I ran it against the examples in the documentation, ran it against a couple variations and it seems to work like I would expect. Maybe you can provide some insight as to why this implementation breaks on the tests.

lihaoyi commented 9 years ago

I don't know why it wouldn't work off the top of my head; you'll have to go and minimize the problem

Alain-Bearez commented 8 years ago

I have been experimenting with a mixed approach, on pull request #54. However this is not ideal, as detailed in this review comment.

The real challenge is to have a solution which is efficient for long running parsers and not to costly on initialisation for one-shot parsers.

lihaoyi commented 8 years ago

I don't understand the problem in detail and I don't understand what your solution is trying to do. For this sort of non-completely-trivial patch, you're going to need to explain what's going on =P

shawjef3 commented 8 years ago

90 seconds to compile StringIn with a 17 character string is way too long.

println("string length,time to compile StringIn (ms)")

for (subset <- "aaaaaaaaaaaaaaaaa".tails.toSeq.reverse) {

  val start = System.currentTimeMillis()

  StringIn(subset)

  val end = System.currentTimeMillis()

  println(s"${subset.size},${end - start}")
}
string length,time to compile StringIn (ms)
0,16
1,5
2,1
3,2
4,4
5,10
6,16
7,17
8,23
9,52
10,79
11,155
12,438
13,1274
14,3264
15,10069
16,29726
17,90797
lihaoyi commented 8 years ago

Yes it is. It's totally screwed and someone needs to fix it.

Anyone want to chip in with an implementation of a Double Array Trie? I've never used one but it seems compact, fast to initialize and fast to query

shawjef3 commented 8 years ago

I'm not so sure you need a fresh implementation. How about the one from Apache commons?

lihaoyi commented 8 years ago

That's plausible. We'd need to...

Nothing fundamentally difficult about this, someone just needs to do it. And I'm spending all day/week/month/year refactoring legacy python code so that person probably won't be me ^_^