Closed p5pRT closed 21 years ago
While trying to help out on a problem in clpm\, I encountered a regexp bug. If I run the following program:
print "foobarbar" =~ /^(.{3\,3})(.+?)(\2+)$/ ? "Yes\n" : "No\n"; print "foobarbar" =~ /^(.{3\,4})(.+?)(\2+)$/ ? "Yes\n" : "No\n"; print "foobarbar" =~ /^(.{3\,4}?)(.+?)(\2+)$/ ? "Yes\n" : "No\n"; print "foobarbar" =~ /^(.{2\,3}?)(.+?)(\2+)$/ ? "Yes\n" : "No\n";
I get with 5.8.0 and 5.8.1-RC2 the output:
Yes No Yes No
5.005\, 5.6.0 and 5.6.1 however give the expected:
Yes Yes Yes Yes
Abigail
On Wed 30 Jul 2003 15:22\, "abigail@abigail.nl (via RT)" \perlbug\-followup@​perl\.org wrote:
While trying to help out on a problem in clpm\, I encountered a regexp bug. If I run the following program:
print "foobarbar" =~ /^\(\.\{3\,3\}\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n"; print "foobarbar" =~ /^\(\.\{3\,4\}\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n"; print "foobarbar" =~ /^\(\.\{3\,4\}?\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n"; print "foobarbar" =~ /^\(\.\{2\,3\}?\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n";
I get with 5.8.0 and 5.8.1-RC2 the output:
Yes No Yes No
5.005\, 5.6.0 and 5.6.1 however give the expected:
Yes Yes Yes Yes
Confirmed:
/pro/bin/perl5.00503 Yes Yes Yes Yes /pro/bin/perl5.6.1 Yes Yes Yes Yes /pro/bin/perl5.8.0 Yes No Yes No /pro/bin/perl5.9.0 Yes No Yes No
-- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using perl-5.6.1\, 5.8.0 & 633 on HP-UX 10.20 & 11.00\, AIX 4.2\, AIX 4.3\, WinNT 4\, Win2K pro & WinCE 2.11. Smoking perl CORE: smokers@perl.org http://archives.develooper.com/daily-build@perl.org/ perl-qa@perl.org send smoke reports to: smokers-reports@perl.org\, QA: http://qa.perl.org
"abigail@abigail.nl (via RT)" \perlbug\-followup@​perl\.org wrote: [...] : print "foobarbar" =~ /^(.{3\,4})(.+?)(\2+)$/ ? "Yes\n" : "No\n";
I'm working with the slightly simpler: /^.{3\,4}(.+)\1\z/s
What happens is that the .{4} branch successively sets $1 to "arbar" through to "a"\, fails and backtracks. However it fails to mark $1 as unmatched; after the REGCP_UNWIND() at (blead) regexec.c:3817 we have: (gdb) p ((int*)PL_regstartp)[1] $31 = 4 (gdb) p ((int*)PL_regendp)[1] $32 = 5 (gdb) p *PL_reglastparen $33 = 1 (gdb)
This then causes an optimisation to kick in just after the 'repeat:' label in S_regmatch(): if (PL_regkind[(U8)OP(text_node)] == REF) { yes\, it is I32 n\, ln; n = ARG(text_node); /* which paren pair */ pair 1 ln = PL_regstartp[n]; index 4 /* assume yes if we haven't seen CLOSEn */ if ( (I32)*PL_reglastparen \< n || wrong reglastparen\, so this fails ln == -1 || open index still marked valid\, so this fails ln == PL_regendp[n] not an empty match (we tried .+\, not .*)\, so this fails ) { c1 = c2 = -1000; goto assume_ok_easy; } so we assume we can optimise this s = (U8*)PL_bostr + ln; }
Andreas\, could you find out at what patchlevel this started failing? That should help point us to the best fix.
Hugo
13002
----Program---- use strict; use warnings; $\="\n"; print "foobarbar" =~ /^.{3\,4}(.+)\1\z/s ? "ok" : "not ok"
----Output of .../pZH7kgr/perl-5.7.2@13001/bin/perl---- ok
----EOF ($?='0')---- ----Output of .../pcU5e7z/perl-5.7.2@13002/bin/perl---- not ok
----EOF ($?='0')----
-- andreas
Andreas J Koenig \andreas\.koenig@​anima\.de wrote: :13002
Thank you very much. This is the patch that introduced the optimisation\, so here is a choice of patches to fix things.
There are three patches below. The first introduces 16 new tests\, aiming to exercise all the relevant code paths. Only one of the other two should be applied: the first of those takes the simple approach\, simply removing this aspect of the optimisation entirely; the other attempts to fix the problem by ensuring that prog->lastparen is appropriately reset on backtracking through repeats.
My main worry about that last patch is that I don't understand why of the 6 obvious places to do that resetting\, precisely 5 are necessary to allow the new tests to pass\, and the 6th (commented out) causes a completely different test failure if left in. However all tests pass with the patch as given\, and perhaps that is enough.
Hugo
Earlier I wrote: :There are three patches below. The first introduces 16 new tests\, :aiming to exercise all the relevant code paths. Only one of the :other two should be applied: the first of those takes the simple :approach\, simply removing this aspect of the optimisation entirely; :the other attempts to fix the problem by ensuring that prog->lastparen :is appropriately reset on backtracking through repeats.
On second thoughts\, I think the second patch should be preferred to the third - the latter (marginally) slows down the matching of every fixed-width quantifier [0] to benefit a relatively rare case\, that of a fixed-width quantifier followed by a backreference. I guess the commonest case that would benefit from the optimisation is the pattern to match simple quoted strings: /(["'])(.*?)\1/ but the more complex cases (eg that handle escaping) would usually require CURLYX rather than CURLY or CURLYM\, in which case the optimisation does not apply in any case.
It may be that at a later date we'll discover that other cases can benefit (in either speed or correctness) from relying on an accurate prog->lastparen\, in which case the alternative patch should be revisited.
Hugo [0] that is\, anything that matches a constant-width pattern a variable (or specified constant) number of times: /.*/\, /[abc]?/ and /(ab|cd){3}/ all qualify\, but not /./ (too simple) nor /(a|bc)*/ (too complex).
@rgs - Status changed from 'new' to 'resolved'
Migrated from rt.perl.org#23171 (status was 'resolved')
Searchable as RT23171$