Closed PhilipHazel closed 1 month ago
FYI, it changed behavior somewhere between 5.32 and 5.34:
--- perl5.30.3 ---
no
--- perl5.32.1 ---
no
--- perl5.34.3 ---
panic: regrepeat() called with unrecognized node type 98='OPFAIL' at -e line 1.
--- perl5.36.3 ---
panic: regrepeat() called with unrecognized node type 98='OPFAIL' at -e line 1.
--- perl5.38.2 ---
panic: regrepeat() called with unrecognized node type 99='OPFAIL' at -e line 1.
--- perl5.39.9 ---
panic: regrepeat() called with unrecognized node type 99='OPFAIL' at -e line 1.
It bisects to 4f0d304ec835f478a4dd9b4ab7af01f5b826c6d7.
bad - non-zero exit from ./perl -Ilib -e "" =~ /[^\S\W]{6}/
4f0d304ec835f478a4dd9b4ab7af01f5b826c6d7 is the first bad commit
commit 4f0d304ec835f478a4dd9b4ab7af01f5b826c6d7
Author: Hugo van der Sanden <hv@crypt.org>
Date: Tue Apr 21 11:50:18 2020 +0100
regexec: disallow zero-width nodes in regrepeat
GH #17594: the logic here expects the node to have width 1 (except for
LNBREAK), it is not expected to do the right thing on zero-width nodes.
regexec.c | 19 -------------------
1 file changed, 19 deletions(-)
This is very weird. There shouldn't be a zero width node from this pattern.
I think this is the bug (from regcomp.c
):
/* All possible optimizations below still have these characteristics.
* (Multi-char folds aren't SIMPLE, but they don't get this far in this
* routine) */
*flagp |= HASWIDTH|SIMPLE;
[^\W\S]
is an empty set, so the optimizer rewrites it to OPFAIL
, which is no longer SIMPLE
.
Oh, i see, it is the empty set because not-space includes word-chars, and not-word includes space chars. So [\S\W] includes all codepoints, thus the inverse contains none. Nice. I didnt catch on to that at first, i was wondering why it doesnt match "word and space chars", eg, why it wasnt the same as /[\s\w]/
.
So I guess the question here is, should OPFAIL be handled specially in regrepeat? It is not zero width in the same way that most other zero width pattern are, as it always fails, so it wouldnt matter if it doesn't match 1 character.
I almost think that treating OPFAIL as simple is fine as you can say that it could match anything (including 1 character) if it were to match, but it always fails so it never matches anything. (Yes that is a twisted thought, but it also makes sense at the same time.)
I suspect we should just tweak 4f0d304ec835f478a4dd9b4ab7af01f5b826c6d7 by keeping OPFAIL in the list. It is not the same as a truly zerowidth assertion like \b or ^ or $ or what not.