djspiewak / parseback

A Scala implementation of parsing with derivatives
http://parseback.io
Apache License 2.0
197 stars 22 forks source link

Implement compaction #32

Open djspiewak opened 7 years ago

djspiewak commented 7 years ago

The original Might, Darias and Spiewak paper reported a 90x speed-up when implementing compaction. Adams, Hollenbeck and Might validated this, as well as providing some improvements to the grammar traversals introduced by repeated compaction.

djspiewak commented 7 years ago

Compaction for everything aside from Union is now implemented in master. The benefits were measurable, but not massive. I saw a roughly 2x improvement.

Two things remain to try:

djspiewak commented 7 years ago

I was able to find a way to implement Union compaction by way of deferred delegation. This is essentially the lazy version of what is done in the Adams Racket implementation (which uses tag mutation instead). I also attempted to implement even more conservative Union compaction: only compacting unions generated by Sequence derivation (since these are guaranteed not to be recursive).

In both cases, I found a serious problem with the whole concept of Union compaction: errors. Aggressive Union compaction makes this very clear with grammars like the following:

lazy val p: Parser[Any] = p ~ "a" | "a"

Apply p to an input string of "b". The error message should be "expected 'a', got 'b'", or something of that sort. Unfortunately, compaction makes this impossible. The right side of the alternate will reduce to a Failure, which will in turn result in a compaction which leaves only the derivative of p ~ "a" w.r.t. b, which in turn will be a compaction of a unionized failure, and thus is simply p ~ "a". The problem is that this is a degenerate form, and the only error it can produce is UnboundedRecursion. The expectation of the next character is completely lost; the information no longer exists in the degenerate form, nor can it be recovered!

More conservative compaction (of just Sequence derivatives) has the same problem, because the form above (p ~ "a" | "a") is in fact derivable from other valid forms. Thus, we cannot safely compact Union in general without throwing away significant error information, which is a non-starter. There may be some very, very specific circumstances wherein it is safe, but I doubt those circumstances form a particularly significant performance optimization. Similarly, it may be possible to carry along "compacted error" information. So rather than compacting to just the right/left of a nulled union, you would compact to the "hypothesized error" of the right/left, where the hypothetical error is merged with any produced errors in the event of failure, or ignored if success. This is unlikely to be substantially more efficient than just Union of a Failure, though it's probably slightly better.

So basically, Union compaction is out. Filter bubbling is still on the table, but unlikely to make a significant impact.

weihsiu commented 7 years ago

does this mean that the goal of coming within the ballpark of parboiled performance is not attainable anymore?

djspiewak commented 7 years ago

@weihsiu Sorry for the delay in responding…  It's unclear actually what the attainable performance is. Efforts here have stalled out on two extremely stupid factors: obtaining a large (millions of lines) set of C sources to test a realistic grammar, and simply my own lack of spare time over the past few months. It's entirely possible that my current benchmarking (which is based on a simple expr grammar) is hitting an inherent pathology in the algorithm. It's certainly quite similar to a known pathological case, so this seems like a real possibility and not just wishful thinking. The goal would be to benchmark with a real grammar (in this case, ANSII C) to see how things actually work in practice, but that has been held up on fiddling with the C preprocessor in order to generate an appropriate corpus.

In the worst case, the performance will remain bad, and we won't be able to replicate the results from the Adams paper due to the constraints of our problem space (thread safety, producing errors, and parsing rather than recognition). If that is indeed the case, then it calls into question the practical applicability of the algorithm, though more implementation work will be needed to really make that assertion.