Some compilation results of `irregex` cannot be cached (write/read)

svenha commented 6 months ago

I have a growing base of 50 complex irregex strings. I precompile all of them, store them in a file, and use the result. This works very nicely, but 1 of the irregex results contains a procedure, so that precompiling cannot work here.

Which expressions lead to procedures in the irregex result?

Can this behavior be influenced by options? (The test is run on version 0.9.10 under bigloo; if it helps, I can repeat it under Chicken or Gauche or ...)

Thanks for this excellent tool for the Scheme community!

sjamaan commented 6 months ago

Hi Sven!

Which expressions lead to procedures in the irregex result?

Typically this happens when it needs to resort to backtracking. The backtracking matcher is a closure that performs the matching (see sre->procedure in the code). There are two reasons compilation will resort to backtracking:

The first case is when you're using something that can't be handled by the NFA (see sre->nfa, anything not handled there will result in #f, which triggers the backtracking). Typically, that's "advanced" and non-regular things like (neg-)look-ahead and (neg-)look-behind, atomic, if, commit and posix-string, but also bos, eos, bol, eol, bow, eow, nwb (which are all a form of one-character lookahead/lookbehind), and finally **, = and >= (see next point for these last three). Note that posix-string is not inherently uncompilable - this could be converted to an SRE and handled recursively in sre->nfa just like we do in sre->procedure, or even better it could be done by a preprocessing step. I just made #45 to track this.

The other reason it would resort to backtracking is when the NFA to DFA compilation (see nfa->dfa) hits a threshold of DFA states. When too many states would be generated, we abort compilation and fall back to the backtracking behaviour. This limit is a bit fuzzy and depends on the small or fast arguments of sre->irregex, see this line. Things like **, = and >= cause a big blowup of states, so you'd want to avoid those (actually, currently their NFA compilation code is commented out completely so even if the repetition count is small it'll still use backtracking).

This is a bit of crude way of doing it, maybe this could be improved.

Also, ideally we should really use an NFA matcher like the SRFI-115 code offers.

Note that irregex comes with a (somewhat hacky) draw-graph.scm script to visualize the NFA or DFA that is compiled. If there's an NFA but no DFA that means we couldn't compile one. You might find that helpful to get a better understanding of how irregex understands your expression.

tl;dr: Avoid nonregular constructions (and also avoid posix-string, for now until #45 is fixed), try the fast option to the irregex constructor to increase the DFA state size limit. If you do hit the state size, try splitting the regex into smaller subregexes, if possible.

svenha commented 6 months ago

@sjamaan Thanks Peter for the explanations. I will try to simplify the regex string ...

ashinn commented 6 months ago

Note we could consider some rewriting options to remove bos, eos, bol, eol, bow, eow, nwb in cases. As a contrived example, consider a regex to extract "key: value" headers, fed one line at a time:

  (: bos (=> key word) ":" (* space) (=> value word))

where word expands to:

  (: bos (=> key (seq bow (+ (or alphanumeric #\_)) eow)) ":" (* space) (=> value (seq bow (+ (or alphanumeric #\_)) eow)))

In both cases, the bow and eow are guaranteed from context, so this can be simplified to:

  (: bos (=> key (+ (or alphanumeric #\_))) ":" (* space) (=> value (+ (or alphanumeric #\_))))

and the DFA compiler can omit the bos at the start of the match, so this can can be compiled.

More generally, something like bow could be allowed in all locations by effectively storing the word/non-word state from all incoming nodes, but this is a lot of work and can lead to a large explosion in DFA size.

ashinn / irregex

Some compilation results of `irregex` cannot be cached (write/read) #44