Closed p5pRT closed 13 years ago
Try this:
$_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'
It should print "match :1:" but it actually prints "no match". Now this:
$_ = "1x"; /(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"'
It should print "match :1:" but it actually prints "match ::".
What's going wrong? You can see from the following "use re 'debug'" output: The "floating substr" optimisation is broken:
Guessing start of match\, REx `^(.*)(?=x)x' against `1x'...
Found floating substr `x' at offset 1...
By STCLASS: moving 0 --> 1
Guessed: match at offset 1
Matching REx `^(.*)(?=x)x' against `x'
Setting an EVAL scope\, savestack=3
1 \<1> \
It is equally broken with regexes of the form /^(.*(?=x))x/\, /^(.*)(?=[x])x/ and /^(.*)(?=[a-z])x/. /^(.*)(?=[^\n])x/\, /^(.*)(?=[1x])x/ and /^(.*)(?=x)./ are fine.
Have a nice day\, Thanks\, -- Jamie
Jamie Lokier (via RT) \perlbug\-followup@​perl\.org wrote: :Try this: : : $_ = "1x"; /^(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"' : :It should print "match :1:" but it actually prints "no match". :Now this: : : $_ = "1x"; /(.*)(?=x)x/ ? print "match :$1:\n" : print "no match\n"' : :It should print "match :1:" but it actually prints "match ::".
Thanks for the report\, I can confirm this bug still exists with the latest development version of perl. It was actually reported before\, and there are example tests (marked as 'expected to fail') in the test suite for exactly this type of case.
The original discussion is at: http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-01/msg00650.html
:What's going wrong? You can see from the following "use re 'debug'" :output: The "floating substr" optimisation is broken:
It isn't the floating substr but the stclass optimisation that's broken; quoting from another part of the -Dr output: floating `x' at 0..2147483647 (checking floating) stclass `ANYOF[x]' minlen 1
That's saying: - we must find a string "x" somewhere\, at any position - the matched text must start with something matching character class [x] - the minimum length we will match is 1
Now\, the stclass logic for /.*x/ should be: must start with ( [^\n] or [x] ) .. and for /.*(?=x)x/: must start with ( [^\n] or ([x] and [x]) )
I believe what happens is that regcomp.c:S_study_chunk() uses flags SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, but because it is working forward through the compiled regexp\, it goes something like: (start) [any] AND .* [^\n] OR (?=x) ([^\n] or [x]) AND x (([^\n] or [x]) and [x]) .. but study_chunk() is quite\, er\, challenging to read through\, and I think I got stuck last time I tried to track this down.
I'll give it another go. :)
Hugo
The RT System itself - Status changed from 'new' to 'open'
hv@crypt.org wrote: :I believe what happens is that regcomp.c:S_study_chunk() uses flags :SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, :but because it is working forward through the compiled regexp\, it :goes something like: : (start) [any] AND : .* [^\n] OR : (?=x) ([^\n] or [x]) AND : x (([^\n] or [x]) and [x]) :.. but study_chunk() is quite\, er\, challenging to read through\, and :I think I got stuck last time I tried to track this down.
So\, if we have an expression like: a?(?=b)c?(?=d)e (where a..e represent possibly overlapping character classes)\, we need to be able to construct a start class like: a | (b & (c | (d & e)))
I think we can achieve this by passing two classes around\, both an 'or' class and an 'and' class. Initialise to [none] and [any] respectively\, and then: - for each subpattern\, combine its two classes with OR - if the subpattern class should be ANDed\, AND it with our 'and' class - if the subpattern should be ORed\, OR (it AND our 'and' class) with our 'or' class
For the above example that would give: where style or-class and-class ------------------------------------------ (start) and none any a? or a any (?=b) and a b c? or a | (b & c) b (?=d) and a | (b & c) b & d e and a | (b & c) b & d & e .. giving a final class of: (a | (b & c)) | (b & d & e) .. which I think is correctly equivalent to the original: a | (b & (c | (d & e)))
If anyone can follow this\, and can either see flaws or confirm its general correctness that would be very useful.
Hugo
\hv@​crypt\.org writes:
hv@crypt.org wrote: :I believe what happens is that regcomp.c:S_study_chunk() uses flags :SCF_DO_STCLASS_AND and SCF_DO_STCLASS_OR to signal what we should do\, :but because it is working forward through the compiled regexp\, it :goes something like: : (start) [any] AND : .* [^\n] OR : (?=x) ([^\n] or [x]) AND : x (([^\n] or [x]) and [x]) :.. but study_chunk() is quite\, er\, challenging to read through\, and :I think I got stuck last time I tried to track this down.
So\, if we have an expression like: a?(?=b)c?(?=d)e (where a..e represent possibly overlapping character classes)\, we need to be able to construct a start class like: a | (b & (c | (d & e)))
I think we can achieve this by passing two classes around\, both an 'or' class and an 'and' class. Initialise to [none] and [any] respectively\, and then: - for each subpattern\, combine its two classes with OR - if the subpattern class should be ANDed\, AND it with our 'and' class - if the subpattern should be ORed\, OR (it AND our 'and' class) with our 'or' class
For the above example that would give: where style or-class and-class ------------------------------------------ (start) and none any a? or a any (?=b) and a b c? or a | (b & c) b (?=d) and a | (b & c) b & d e and a | (b & c) b & d & e .. giving a final class of: (a | (b & c)) | (b & d & e) .. which I think is correctly equivalent to the original: a | (b & (c | (d & e)))
If anyone can follow this\,
Not at 2nd reading\, but it smells of De Morgan
A&B = ~(~A|~B)
and can either see flaws or confirm its general correctness that would be very useful.
Hugo
On Mon Mar 08 09:11:14 2004\, jamie wrote:
This is a bug report for perl from jamie@shareable.org\, generated with the help of perlbug 1.34 running under perl v5.8.0.
----------------------------------------------------------------- [Please enter your report here]
Try this:
$\_ = "1x"; /^\(\.\*\)\(?=x\)x/ ? print "match :$1​:\\n" : print "no
match\n"'
It should print "match :1:" but it actually prints "no match".
perl-5.8.8: no-match perl-5.8.9-tobe: no-match perl-5.10.0: no-match perl-blead: no-match
Now this:
$\_ = "1x"; /\(\.\*\)\(?=x\)x/ ? print "match :$1​:\\n" : print "no
match\n"'
It should print "match :1:" but it actually prints "match ::".
perl-5.8.8: match :: perl-5.8.9-tobe: match :1: perl-5.10.0: match :1: perl-blead: match :1:
What's going wrong? You can see from the following "use re 'debug'" output: The "floating substr" optimisation is broken:
Guessing start of match\, REx \`^\(\.\*\)\(?=x\)x' against \`1x'\.\.\. Found floating substr \`x' at offset 1\.\.\. By STCLASS​: moving 0 \-\-> 1 Guessed​: match at offset 1 Matching REx \`^\(\.\*\)\(?=x\)x' against \`x' Setting an EVAL scope\, savestack=3 1 \<1> \<x> | 1​: BOL failed\.\.\. Match failed no match
It is equally broken with regexes of the form /^(.*(?=x))x/\, /^(.*)(?=[x])x/ and /^(.*)(?=[a-z])x/. /^(.*)(?=[^\n])x/\, /^(.*)(?=[1x])x/ and /^(.*)(?=x)./ are fine.
This now works as of 5.13.11. --Karl Williamson
@khwilliamson - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#27515 (status was 'resolved')
Searchable as RT27515$