Open svenha opened 6 months ago
Hi Sven!
Which expressions lead to procedures in the irregex result?
Typically this happens when it needs to resort to backtracking. The backtracking matcher is a closure that performs the matching (see sre->procedure
in the code). There are two reasons compilation will resort to backtracking:
The first case is when you're using something that can't be handled by the NFA (see sre->nfa
, anything not handled there will result in #f
, which triggers the backtracking). Typically, that's "advanced" and non-regular things like (neg-
)look-ahead
and (neg-
)look-behind
, atomic
, if
, commit
and posix-string
, but also bos
, eos
, bol
, eol
, bow
, eow
, nwb
(which are all a form of one-character lookahead/lookbehind), and finally **
, =
and >=
(see next point for these last three). Note that posix-string
is not inherently uncompilable - this could be converted to an SRE and handled recursively in sre->nfa
just like we do in sre->procedure
, or even better it could be done by a preprocessing step. I just made #45 to track this.
The other reason it would resort to backtracking is when the NFA to DFA compilation (see nfa->dfa
) hits a threshold of DFA states. When too many states would be generated, we abort compilation and fall back to the backtracking behaviour. This limit is a bit fuzzy and depends on the small
or fast
arguments of sre->irregex
, see this line. Things like **
, =
and >=
cause a big blowup of states, so you'd want to avoid those (actually, currently their NFA compilation code is commented out completely so even if the repetition count is small it'll still use backtracking).
This is a bit of crude way of doing it, maybe this could be improved.
Also, ideally we should really use an NFA matcher like the SRFI-115 code offers.
Note that irregex comes with a (somewhat hacky) draw-graph.scm
script to visualize the NFA or DFA that is compiled. If there's an NFA but no DFA that means we couldn't compile one. You might find that helpful to get a better understanding of how irregex understands your expression.
tl;dr: Avoid nonregular constructions (and also avoid posix-string
, for now until #45 is fixed), try the fast
option to the irregex
constructor to increase the DFA state size limit. If you do hit the state size, try splitting the regex into smaller subregexes, if possible.
@sjamaan Thanks Peter for the explanations. I will try to simplify the regex string ...
Note we could consider some rewriting options to remove bos, eos, bol, eol, bow, eow, nwb in cases. As a contrived example, consider a regex to extract "key: value" headers, fed one line at a time:
(: bos (=> key word) ":" (* space) (=> value word))
where word
expands to:
(: bos (=> key (seq bow (+ (or alphanumeric #\_)) eow)) ":" (* space) (=> value (seq bow (+ (or alphanumeric #\_)) eow)))
In both cases, the bow and eow are guaranteed from context, so this can be simplified to:
(: bos (=> key (+ (or alphanumeric #\_))) ":" (* space) (=> value (+ (or alphanumeric #\_))))
and the DFA compiler can omit the bos at the start of the match, so this can can be compiled.
More generally, something like bow could be allowed in all locations by effectively storing the word/non-word state from all incoming nodes, but this is a lot of work and can lead to a large explosion in DFA size.
I have a growing base of 50 complex irregex strings. I precompile all of them, store them in a file, and use the result. This works very nicely, but 1 of the irregex results contains a procedure, so that precompiling cannot work here.
Which expressions lead to procedures in the irregex result?
Can this behavior be influenced by options? (The test is run on version 0.9.10 under bigloo; if it helps, I can repeat it under Chicken or Gauche or ...)
Thanks for this excellent tool for the Scheme community!