PCRE2Project / pcre2

PCRE2 development is now based here.
Other
921 stars 194 forks source link

`pcre2_jit_compile` returns `PCRE2_ERROR_NOMEMORY` on patterns compiled with `PCRE2_AUTO_CALLOUT` #437

Closed siegel closed 3 months ago

siegel commented 3 months ago

Version info from pcre2.h:

/* The current PCRE version information. */

#define PCRE2_MAJOR           10
#define PCRE2_MINOR           45
#define PCRE2_PRERELEASE      -DEV
#define PCRE2_DATE            2024-06-09

Platform: macOS (Darwin), 14.6, arm64. (The OS version is inconsequential; I've not yet tested on Intel but it was happening there too, as of a long time ago.)

At runtime, I programmatically generate this pattern:

(?(?=\C|\z)(?=\C|\z)(?=\C|\z)|(?P<comment>#.*$)(?P<string>(?x:
            (?> "   (?s: \\. | [^"] )*?     (?: " | $)  )   |
            (?> '   (?s: \\. | [^'] )*?     (?: ' | $)  )   |
            (?> `   (?s: \\. | [^`] )*?     (?: ` | $)  )
        ))(?P<number>^\.PHONY:[^\\\r]*(\\\r[^\\\r]*)*))(?x:
            (?P>comment)    |
            (?P>number) |
            (?P>string)
        )

The code which does this is fairly intertwined, but what gets run looks like this (NB: I am using pcre2_16):

    pcre2_compile_context
                        *compileContext = NULL;
    pcre2_code          *pcre = NULL;

    compileContext = pcre2_compile_context_create(NULL);
    verify_noerr(pcre2_set_compile_extra_options(compileContext, PCRE2_EXTRA_ESCAPED_CR_IS_LF));

    pcre = pcre2_compile(pattern, patternLength, options, &errornum, &erroroffset, compileContext);
    if (NULL != pcre)
    {
        int jitError = 0;

        jitError = pcre2_jit_compile(pcre, PCRE2_JIT_COMPLETE);
    }

At the time this runs, options may vary, but in this case its value is 0xc00a0404. Notably, this includes PCRE2_AUTO_CALLOUT.

When compiling the generated pattern above, the call to pcre2_jit_compile returns PCRE2_ERROR_NOMEMORY (-48). This isn't a new issue, and most of the time life goes on just fine without JIT. But it's been bothering me that I can't JIT these patterns, so I started to debug it.

To make a very long story short, I found that on line 14283 in pcre2_jit_compile.c it returns PCRE2_ERROR_NOMEMORY if check_opcode_types() returns FALSE:

if (!check_opcode_types(common, common->start, ccend))
  {
  SLJIT_FREE(common->optimized_cbracket, allocator_data);
  return PCRE2_ERROR_NOMEMORY;
  }

So I set breakpoints in the places in that function where there's a return FALSE;. And this one fires, on line 1167:

    case OP_COND:
    case OP_SCOND:
    /* Only AUTO_CALLOUT can insert this opcode. We do
       not intend to support this case. */
    if (cc[1 + LINK_SIZE] == OP_CALLOUT || cc[1 + LINK_SIZE] == OP_CALLOUT_STR)
      return FALSE;
    cc += 1 + LINK_SIZE;
    break;

My pattern is indeed compiled with PCRE2_AUTO_CALLOUT, so the callout opcodes will be in the compiled pattern.

This strikes me as odd, because pcre2_jit_compile() does support ordinary callouts:

    case OP_CALLOUT:
    case OP_CALLOUT_STR:
    if (common->capture_last_ptr == 0)
      {
      common->capture_last_ptr = common->ovector_start;
      common->ovector_start += sizeof(sljit_sw);
      }
    cc += (*cc == OP_CALLOUT) ? PRIV(OP_lengths)[OP_CALLOUT] : GET(cc, 1 + 2*LINK_SIZE);
    break;

Would it be possible to relax the restriction, so that pcre2_jit_compile() allows auto-callouts?

Additionally, the documentation isn't quite in agreement:

It doesn't mention that pcre2_jit_compile(3) will also return PCRE2_ERROR_NOMEMORY for other specific reasons, such as if the pattern uses auto-callouts.

This suggests that perhaps the behavior I'm currently encountering is a bug; but if it's intentional the documentation needs to clearly state that PCRE2_AUTO_CALLOUT may cause JIT compiles to fail.

So:

  1. I think it makes sense for pcre2_jit_compile() return specific codes for specific failures, not just "can't allocate memory" (which is a lie in this case; there was no allocation failure, just a bad opcode).

  2. It appears that pcre2_jit_match() supports callouts generally. So is it possible that the auto-callout test in check_opcode_types() is erroneous, and that auto-callouts ought to be allowed at that point?

  3. If auto-callouts are not to be supported in JIT-ed patterns, I'd like to have a flag to pass to pcre2_jit_compile() that says "I don't care whether this contains auto-callouts; compile it anyway and NOP them." (I don't know how practical that is, though.)

Thanks for reading. :-)

zherczeg commented 3 months ago

I don't remember why the code returns with no memory for unsupported features. It does not make sense. Maybe there was no better error code. Adding new error codes will be a new feature.

Auto callouts are supported in general, but for conditions, we probably need to save/restore several registers. Implementing this would be a new feature as well.

siegel commented 3 months ago

I don't remember why the code returns with no memory for unsupported features. It does not make sense. Maybe there was no better error code. Adding new error codes will be a new feature.

Should I create a new issue for that specific feature? (I'm happy to make an argument that it's a bug, but if you're doing the work and consider it a feature, I will write it up as a feature. :-) )

Auto callouts are supported in general, but for conditions, we probably need to save/restore several registers. Implementing this would be a new feature as well.

Can you elaborate on this a bit? I'm not clear on what exactly is occuring at that point in the pattern compilation.

(Also, if you consider support for autocallouts after conditionals to be a feature to be added, should I likewise write up a new issue for it? How should I title it?)

zherczeg commented 3 months ago

You can open reports if you wish, but it is not necessary.

The only thing I remember is that the conditionals with callouts was a complex case. Something was needed to do for conditionals before and after the callout happens. The interpreter also does some specialization:

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L5627

zherczeg commented 3 months ago

441

zherczeg commented 3 months ago

I have checked the callout problem. The main issue is that the callout can escape from conditional before the condition is checked. Supporting cond/scond is already requires a lot of specialization in the code, and supporting iterated conditional with an extra escape is a rather complex case. Introducing bugs in the code is very likely.

siegel commented 3 months ago

I appreciate you taking the time to look into the feasibility.

I ended up resolving this issue by turning off PCRE2_AUTO_CALLOUT for these patterns and instead generating the pattern with explicit (?C) callouts (which are fine for my needs), and further by reworking the pattern to use (?s).|\z instead of \C (the latter which fails to JIT compile when the pattern was compiled with PCRE2_UTF, as mine are).

So I think this issue can probably be closed leaving #441. :-)

PhilipHazel commented 3 months ago

I'll be reviewing #441 later today. Closing this issue now.