Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.97k stars 560 forks source link

Optimization bug preventing a regex containing (*SKIP)(*FAIL) to work correctly #21534

Closed florian-pe closed 1 year ago

florian-pe commented 1 year ago

The following regex doesn't match while it should.

"dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x

Disabling the optimization (I don't know which one causes this) with (?{}) makes the match succeed as it should.

say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: (?{}) / (*SKIP)(*FAIL) | \z ) }x

Additionally, Using a possessive quantifier makes the bug go away.

$ perl -E 'say $& if "dir/file.mp3" =~ m{ ([^/]++) (?: / (*SKIP)(*FAIL) | \z ) }x'
file.mp3

When running the faulty regex with use re "debug"; the message Contradicts stclass... [regexec_flags] appears.

$ perl -E 'use re "debug"; say $& if "dir/file.mp3" =~ m{ ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) }x'
Compiling REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "
Final program:
   1: OPEN1 (3)
   3:   PLUS (6)
   4:     NANYOFM[/] (0)
   6: CLOSE1 (8)
   8: BRANCH (buf:1/1) (16)
  10:   EXACT </> (12)
  12:   SKIP (14)
  14:   OPFAIL (20)
  16: BRANCH (buf:1/1) (FAIL)
  18:   EOS (20)
  19: TAIL (20)
  20: END (0)
stclass NANYOFM[/] plus minlen 1
Matching REx " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) " against "dir/file.mp3"
Matching stclass NANYOFM[/] against "dir/file.mp3" (12 bytes)
   0 <> <dir/file.m>         |   0| 1:OPEN1(3)
   0 <> <dir/file.m>         |   0| 3:PLUS(6)
                             |   0| NANYOFM[/] can match 3 times out of 2147483647...
   3 <dir> </file.mp3>       |   1|  6:CLOSE1(8)
   3 <dir> </file.mp3>       |   1|  8:BRANCH (buf:1/1)(16)
   3 <dir> </file.mp3>       |   2|   10:EXACT </>(12)
   4 <dir/> <file.mp3>       |   2|   12:SKIP(14)
   4 <dir/> <file.mp3>       |   3|    14:OPFAIL(20)
                             |   3|    failed...
                             |   2|   failed...
Contradicts stclass... [regexec_flags]
Match failed
Freeing REx: " ([^/]+) (?: / (*SKIP)(*FAIL) | \z ) "
demerphq commented 1 year ago

I will try to find some time for this, but I don't know when. I agree that it is a bug. It is surprising to me that the stclass is getting in the way here, also that the possessive quantifier changes the behavior.

demerphq commented 1 year ago

@khwilliamson you created the related patches: 21d1ed54f05 and 56ff0609361.

I think there is a thinko in these patches, as they update previous_occurence_end after executing reg_try(), which seems to partly or mostly defeat the point of checking if s == previous_occurrenc_end in the first place. Unfortunately this logic is inside of a set of macros (as the fundamental logic is repeated over and over with different functions), so it is a bit hard to see.

In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class:

        REXEC_FBC_UTF8_FIND_NEXT_SCAN(
                        (char *) find_span_end_mask((U8 *) s, (U8 *) strend,
                                                    (U8) ARG1u(c), FLAGS(c)));

The macros are as follows:

/* We keep track of where the next character should start after an occurrence
 * of the one we're looking for.  Knowing that, we can see right away if the
 * next occurrence is adjacent to the previous.  When 'doevery' is FALSE, we
 * don't accept the 2nd and succeeding adjacent occurrences */
#define FBC_CHECK_AND_TRY                                           \
        if (   (   doevery                                          \
                || s != previous_occurrence_end)                    \
            && (   reginfo->intuit                                  \
                || (s <= reginfo->strend && regtry(reginfo, &s))))  \
        {                                                           \
            goto got_it;                                            \
        }

#define REXEC_FBC_UTF8_FIND_NEXT_SCAN(f)                    \
    while (s < strend) {                                    \
        s = (char *) (f);                                   \
        if (s >= strend) {                                  \
            break;                                          \
        }                                                   \
                                                            \
        FBC_CHECK_AND_TRY                                   \
        s += UTF8SKIP(s);                                   \
        previous_occurrence_end = s;                        \
    }

I believe that the update to previous_occurrence_end should be directly after f() has returned, not after reg_try(), which may advance s considerably itself.

EDIT: actually not /directly/ after f() has returned, but before regtry() is executed.

khwilliamson commented 1 year ago

On 9/30/23 08:23, Yves Orton wrote:

@khwilliamson https://github.com/khwilliamson you created the related patches: 21d1ed5 https://github.com/Perl/perl5/commit/21d1ed54f05b0cc903eaa25cbc4575df1a5a89ed and 56ff060 https://github.com/Perl/perl5/commit/56ff0609361466f7eb706d56bdaf69e44342c2e1.

I think there is a thinko in these patches, as they update |previous_occurence_end| after executing reg_try(), which seems to partly or mostly defeat the point of checking if |s == previous_occurrenc_end| in the first place. Unfortunately this logic is inside of a set of macros (as the fundamental logic is repeated over and over with different functions), so it is a bit hard to see.

I still think it is right. But more comments and maybe renaming the variable would be helpful. 'previous_occurrence_end' is set by the macros to point 1 beyond the current occurrence end, in preparation for the next iteration when it will indeed mean the end of the previous occurrence.

I also bisected the failing test. It did not pass on the initial commit that added SKIP, nor in any official release so far. (FAIL was added earlier than *SKIP was)

In the following logic the macro parameter 'f' is a function call which does a scan of the string to find how many characters match a given class:

|REXEC_FBC_UTF8_FIND_NEXT_SCAN( (char ) find_span_end_mask((U8 ) s, (U8 *) strend, (U8) ARG1u(c), FLAGS(c))); |

The macros are as follows:

|/ We keep track of where the next character should start after an occurrence of the one we're looking for. Knowing that, we can see right away if the next occurrence is adjacent to the previous. When 'doevery' is FALSE, we don't accept the 2nd and succeeding adjacent occurrences / #define FBC_CHECK_AND_TRY \ if ( ( doevery \ || s != previous_occurrence_end) \ && ( reginfo->intuit \ || (s <= reginfo->strend && regtry(reginfo, &s)))) \ { \ goto got_it; \ } #define REXEC_FBC_UTF8_FIND_NEXT_SCAN(f) \ while (s < strend) { \ s = (char ) (f); \ if (s >= strend) { \ break; \ } \ \ FBC_CHECK_AND_TRY \ s += UTF8SKIP(s); \ previous_occurrence_end = s; \ } |

I believe that the update to |previous_occurrence_end| should be directly after |f()| has returned, not after |reg_try()|, which may advance s considerably itself.

— Reply to this email directly, view it on GitHub https://github.com/Perl/perl5/issues/21534#issuecomment-1741777189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2DHYWFINNYJ34S2XDW6TX5ATNXANCNFSM6AAAAAA5L3RS3U. You are receiving this because you were mentioned.Message ID: @.***>

demerphq commented 1 year ago

My original theory on how to fix this issue was wrong, or at least more complex than I expected. So I went with a different solution. @khwilliamson it would be nice if you reviewed the previous_occurrence_end logic in light of PREGf_SKIP and the fact that regtry() may modify the string pointer as a side-effect, the existing logic doesn't entirely make sense to me.

Consider a similar case to the one in this bug report:

./perl -Ilib -Mre=debug -le'"AAAAAAABBBBBBBAAAC"=~/([Aa]+(B+(*SKIP)(*FAIL)|[CD]))/ and print $1'

When we start we start matching at pos 0, and match ""AAAAAAABBBBBBB" before we hit the /(SKIP)(FAIL)/, and which causes regtry() to set 's' to point at the last 'B', which is then incremented to point at the first 'A' following the 'B's. We then loop around in the FBC macro, and compute f() again, which returns an 's' pointing at the same spot, and since do_every is false, due to PREGf_SKIP being true, we do not try the pattern again when we should. With the fix in this patch we notice that reginfo->cutpoint is set, thus 's' was updated by regtry() and so we DO try again. But it feels to me like we should not need to inspect reginfo->cutpoint like that, and that we should have noticed in some other way that we were no longer in the original sequence of 'A's, the check on previous_occurrence_end seems just a little simple for what we are trying to do here.