Closed jbclements closed 3 years ago
This should follow the leftmost longest rule, i.e. the first match should be returned. It's not very intuitive to pick out a single instance of the matches in a kleene star, which is why I suspect this hasn't come up before.
Note it works correctly in the backtracking path so is specific to the tNFA construction. The tNFA should preserve the first end position after the submatch has matched once. @sjamaan might be the better person to look at this.
I will look into this when I find some quiet time
Thanks! And I don't mean to put you on the spot - I'll look at it eventually if it remains unfixed :)
Of course, no worries! I have to get started with $DAYJOB now, but after some quick checking, I've discovered:
a) if I disable command reordering (i.e., I comment out the second argument to the or
in find-reorder-commands
), the code works perfectly. As the comments say, I expected bugs in that code. Now I just need to re-grok how it works and is supposed to work ;)
b) Either way, the dfa looks quite large. This probably has to do with the way the tag and memory slots operate, but it's a bit strange to watch a relatively small nfa explode into a much bigger dfa.
Ah, it turns out the bug is in the reordering commands themselves: they must be ordered in such a way that swappings are allowed. For example, if two states are identical except that memory slots 0 and 1 are swapped, we now emit commands like this:
p[0] = p[1]
p[1] = p[0]
Of course this won't work. We'll need to read all slots and then write them. I have committed a trivial patch for it which memoizes all the old values in a closure before executing all the updates. It's not pretty but should work fine. Please have a look at the latest version!
In the following example that uses SRE syntax, a named sub-pattern, and a kleene star (I think the
or
might be necessary too), the string returned by irregex-match-substring looks like it goes from the beginning of the first match to the end of the last match (including all chars in between).