Performance of (c/plus (c/alt ...)) combinator vs (c/regexp #"[...]+")

theronic commented 4 years ago

I'm seeing an 80x-120x performance slowdown using the (c/plus (c/alt ...)) combinator to parse a string of special characters as compared to regex, (c/regexp #"[\a\b\c...]+").

In comparison, matching on the same string using Clojure Spec2 Alpha with (s/+ #{\a \b \c ...}) is only 20x slower than the regexp.

The combinators are so much more readable, composable and less error-prone than writing string-based regular expressions, so I'd love to use them, but the performance difference is significant.

Is there a way to speed up a set match, or write a manual matcher? It would be great if I could pass Instaparse any function and it eats whatever the function returns, unless it's a special :instaparse/invalid value.

Here are my Criterium benchmarks (done on i9 MBP16):

(let [test-input        "~|'[@:];|;{{}=:;<];~[^|{?"
       charset           #{\: \; \< \= \> \? \@,
                            \[ \] \\ \^ \_ \',
                            \{ \| \} \~}
      test-spec         (s/+ charset)                     ;; (aside: Spec2 will moan about fully-qualified symbol #bug)
      combinator-parser (insta/parser
                            {:S (c/plus (->> charset
                                          (map (comp c/unicode-char int))
                                          (apply c/alt)))}
                            :start :S)
      regex-parser      (insta/parser
                            {:S (regexp #"[\:\;\<=\>\?\@\[\]\\\^\_\'\{\|\}\~]+")} :start :S)]
    (crit/quick-bench                                       ;; uncomment ones we don't care about
      (s/conform test-spec (seq test-input)) ;; mean: 117.3us
      (insta/parse combinator-parser test-input) ;; mean: 614us
      (insta/parse regex-parser test-input))) ;; mean: 5.43us

(I also tried using the defparser macro, with similiar results.)

Engelberg commented 4 years ago

Generally anything that can be expressed as a regular expression will be way faster as a regular expression.

The regular expression has a huge advantage in that the correct interpretation can be determined locally by what immediately follows and nothing beyond.

Instaparse is aiming to be truly general across all types of grammars, so that means the correct alternative right now could potentially be determined by some character at the very end of the string. So instaparse has to keep open all the other possibilities so it can backtrack to this point later if need be.

Consider an alternative "a" or "aa" in a context of "aaabc". The correct match depends entirely on what kinds of rules make the rest of the string make sense. If there's a rule afterwards to match "abc", that's different than if there is a rule that matches "aabc".

Regexes just make a decision (and you have some control over whether it makes greedy or non-greedy choices with certain annotations), and can do some backtracking within its own local context, but it definitely doesn't have to worry about the context of the other rules in your grammar.

So, short answer is: use a regexp whenever that's appropriate, if you want best performance. Usually regexps are the best choice for the "tokens" in your grammar.

There's a deeper concept to be explored that maybe it would be interesting to allow the user to delimit some sort of scope for backtracking, so once you match up to a certain point, instaparse can throw away all the accumulated backtracking spots generated from unexplored alternatives. This would improve performance considerably, allowing instaparse's alternatives to compete more favorably with regexps, but it's complicated to figure out what the interface for that would look like, and complicated to implement. An interesting research project, but not one I have time to pursue in the near future.

stereobooster commented 4 months ago

I have feature request related to this - provide some kind of syntax ({} maybe) to collapse sub-tree to string. So it would be similar to regex behavior:

S = {'a'+}

would produce the same parse tree as

S = #'a+'

There's a deeper concept to be explored that maybe it would be interesting to allow the user to delimit some sort of scope for backtracking, so once you match up to a certain point, instaparse can throw away all the accumulated backtracking spots generated from unexplored alternatives.

Usually regexps are the best choice for the "tokens" in your grammar.

In Spoofax it is called Lexical syntax

Engelberg / instaparse

Performance of (c/plus (c/alt ...)) combinator vs (c/regexp #"[...]+") #207