PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

Non-recursive scan prefix in JIT #560

Closed zherczeg closed 1 week ago

zherczeg commented 1 week ago

The algorithm is rewritten. It should cover more cases and non-recursive.

Testing is difficult though. When it covers cases that it should not, the matching should fail, so that can be tested at some level. However, the opposite, when it does not cover cases, that it should, the effect is just slower matching.

Fixes #558 since it is non-recursive.

NWilson commented 1 week ago

I didn't have time to look at this yesterday; but I will today.

zherczeg commented 1 week ago

I will try to explain the algorithm using examples. The prefix is a simplified pattern, which has a fixed length, and it is a string of small character sets and dots in regexp terms.

Example: /(abc|xbyd)/ prefix is [ax]b[cy] (3 characters long) /a[a-z]b+c/ prefix is a.b (3 characters long) (I decided that + may introduce too many combinations, and not worth to continue. Never proved that it is true or false, just my feelings.) /ab?cd/ prefix is a[bc][cd] (3 characters long) /(ab|cd)|(ef|gh)/ prefix is [aceg][bdfh] (2 characters long)

Why this prefix is collected: jit can generate a simd code, which can search ...[ab]...[cd] patterns very fast (currently disabled on Windows though). So in the prefix we search for maximum two character sets (prefer single characters if possible), where the two sets has no common character. In my experiences the chance of such pairs are very low in a real text, so an ab is quite rare, but a "`" (double space) is frequent. We even prefer an " [ax]" over " `", although the latter is "simpler".

The scan_prefix scans the pattern in a simple loop. However, if an | or character? is encountered, we save its location on the stack, and continue to the ending ')'. This way we can parse /a(bc|cb|xy)e/ as a[bcx][cby]e.

I hope this explains how the algorithm works. In jit, everything is rather complex unfortunately. This gives its speed, but quite challenging for understanding.

Let me know if you need more explanation.

NWilson commented 1 week ago

Ouch! I knew my understanding of the code's intentions didn't match what the code was doing, but I couldn't work out its purpose from just reading the loop.

That's really helpful explanation. I should be able to match that behaviour to the code now.

zherczeg commented 1 week ago

I have added the examples as a comment to the function. Let me know if more changes are needed.

zherczeg commented 1 week ago

I plan to land this patch soon, then #559, then the rest. Please let me know if this needs more changes.

PhilipHazel commented 1 week ago

Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.

When I first wrote PCRE in 1997, Perl regular expressions were a lot simpler than they are today. I'm actually quite gratified to know that we've managed to keep on upgrading the code over nearly 30 years. The original set of opcodes was invented "off the top of my head" but has been changed over the years. For example, at first I had a "string" opcode rather than OP_CHAR. That had its problems; this is from the ChangeLog for 2.07:

  1. Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was
    causing the entire string to be ignored, instead of just the last character.

It was not until release 5.0 (2004) that the string op was abolished in favour of OP_CHAR. (Reading the PCRE1 ChangeLog can be quite instructive.) Note that JIT support didn't come along until 2011 (release 8.20), a little bit after I got rid of tracking options at runtime (there used to be OP_OPT) and introduced OP_CHARI and other caseful/caseless pairs in 8.13.

So yes, it was the interpreter that influenced all these early decisions.

@zherczeg is also correct in saying that not much attention has been paid to the exact values of error offsets.