PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

Fix OP_REFI for caseless_restrict #516

Closed NWilson closed 1 month ago

NWilson commented 1 month ago

When you introduced caseless_restrict earlier in the year, it looks like you forgot to add it to the OP_REFI (and OP_DNREFI) comparison.

It's actually a bit tricky to add, because it seems the "flags" that are tracked during compile are not written out to the bytecode.

So, I added one extra field to OP_REFI to record the caseless flag. (Currently, it's just "caseless_restrict", but as I mentioned, I'll be adding "turkish_casing" in a future PR.)

I think there's no other way to do it? I hope the flags aren't stored in the bytecode already and I just missed it.

It seems a bit wasteful, perhaps, to spend a uint8 (or uint32 in the case of the 32-bit library), but I think you basically forced this outcome back when you added the (?r) flag allowing caseless_restrict to be varied through the pattern.

NWilson commented 1 month ago

@zherczeg Zoltan, I've half-heartedly updated the JIT code here. It hardcodes the length of the opcodes in many places, so after extending OP_REFI and OP_DNREFI I had to find all references I could.

The JIT code seems to be working on all the existing tests, but fails on the one new test I added, which exercises the caseless_restrict behaviour in a backref.

I'd be grateful if you'd be able to help me out on that.

zherczeg commented 1 month ago

Nice catch! I am not sure which one is better, adding a new opcode (only the case insensitive variant may use restrict), or adding an argument. I would prefer a new opcode. @PhilipHazel what do you think?

NWilson commented 1 month ago

When I add PCRE2_EXTRA_TURKISH_CASING it would double the number of possibilities, so it would end up as 4 opcodes. That's not terrible, so I suppose either is possible.

zherczeg commented 1 month ago

I didn't know about that. Can restricted ascii and Turkish combined? I don't how that casing is working. Another option is using global options in the interpreter, if possible.

NWilson commented 1 month ago

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk).

And it can't use global options because Philip already added an inline option /(?r: ... )/ rather than making it a pre-pattern option instead.

zherczeg commented 1 month ago

I completely forgot that it is not a pre-pattern. I simply used pattern flags here: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L448

No test failed. Anyway it could be fixed by passing extra options.

zherczeg commented 1 month ago

JIT fix https://github.com/PCRE2Project/pcre2/commit/ae118788448e669fef8997ab9ac3f9dbb20daf4c

Added some test as well. It turned out that DNREFI support was not added to jit :(

Feel free to use the commit, no need to mention me in the patch.

PhilipHazel commented 1 month ago

I see the CI tests are failing.

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

  1. Some internal refactoring has changed the processing so that the handling
    of the PCRE_CASELESS and PCRE_MULTILINE options is done entirely at compile time (the PCRE_DOTALL option was changed this way some time ago: version
    7.7 change 16). This has made it possible to abolish the OP_OPT op code,
    which was always a bit of a fudge.

Doing all the options handling at compile time should also in theory make matching a little bit faster. PCRE1 still uses the stack for backtracking, and the above change reduced the number of arguments to the recursive function, which helped with stack usage. This is not relevant for PCRE2 since the 10.30 refactoring.

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

As for whether to add a new opcode or an argument, either will require checking every reference to the original, so much the same amount of work. We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something? Eight is rather a lot; if I'm right about that, then perhaps an argument makes more sense. But I don't really mind which you do, though multiple opcodes might execute faster? BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

carenas commented 1 month ago

Unfortunately, all four options are possible, in theory (-r -turk, +r -turk, -r +turk, +r +turk)

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

of course I might be wrong, since I haven't seen the "turk" flag implementation but I am assuming it implies the 2 entries in CaseFolding.txt we skip:

0049; T; 0131; # LATIN CAPITAL LETTER I
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
zherczeg commented 1 month ago

Oh, a /* Fall through. */ is missing from the patch.

carenas commented 1 month ago

Oh, a /* Fall through. */ is missing from the patch.

Fixed in ac76507bf515628446a5ace942ca503d03e5dca5; took me a while to find the "standard" spelling though since it seems only sljit has those until now.

NWilson commented 1 month ago

Historical note: the original PCRE did keep track of the options during matching, and there was (for example) only OP_CHAR, not OP_CHARI. However, all the saving and restoring got complicated and here is a ChangeLog entry for 8.13:

Thank you Philip, that's really useful and interesting!

I added (?r) because it seemed right to make it the same as other options, but also so that pattern creators who have no access to the calling code can set it for the whole pattern in the same way as (?i) etc., though it could have been (*CASELESS_RESTRICT).

There's no reason to regret it now, it's most-flexible and works well.

We are not short of opcodes so it may make sense to add four but isn't it eight? Four versions of OP_REFI and four versions of OP_DNREFI, or am I missing something?

Yes, it would be four each, you've counted correctly.

BUT there are actually quite a lot of opcodes that end in ...I. Do they all need looking at now there are (will be) four different ways of doing a caseless match?

Aha, no! Because the other OP_...I opcodes have quite specific behaviours tied to the default case-folding relationship. For example, OP_CHARI is fixed to match exactly two characters (the one that follows it in the bytecode, and that char's "other case"). The notion of "other case" isn't affected by caseless-restrict, so no change is needed.

For caseless-restrict, it really is just the REFI and DNREFI code that needs to care about the specifics of the case equivalence.

NWilson commented 1 month ago

IMHO the new "turk" flag would be incompatible with +r, because by definition what the later does is IGNORE any othercase that crosses the ASCII/UTF boundary (see Perl's /aa mode).

I think they're compatible, logically. (?r) removes the case-folding rules that cross the ASCII/non-ASCII boundary (which is actually just two characters). And Turkish adds two case-folding rules for the Turkish alphabet. They aren't touching the same characters, so they both make sense.

Maybe a user doesn't like the "Kelvin sign"? Not a problem - you can remove that, and you still get to choose whether to have the Turkish mappings or not.

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

NWilson commented 1 month ago

Thank you very much all three of you for your help, on this very fussy little detail fix!

carenas commented 1 month ago

To put it another way: if caseless-restrict is useful, why shouldn't Turkish users be able to select it?

because it is useful only in a specific context (see #11, although the Perl link I posted earlier is probably clearer IMHO), where the user intentionally wants to make sure that a caseless match is only within (or not) ASCII to avoid "surprises".

a "turk" flag user wants to have Unicode characters, and indeed will add 2 more cross between ASCII and Unicode when doing caseless matches, which is what (?r) is meant to prevent.

you are correct it doesn't impact the same characters, and your interpretation of how it could work together is valid, but the point I was trying to make is that (?r) restricts ALL characters (eventhough they are currently only 2 pairs affected in our case tables). as a plus we would only need 3 versions of each affected opcode.

zherczeg commented 1 month ago

@carenas I see /* Fall through */ in the other places as well, if you prefer to use /* fallthrough */, change it everywhere (in another patch). I also prefer one patch for one PR.

zherczeg commented 1 month ago

I hope Perl does not plan to use 'r' for something different.

NWilson commented 1 month ago

Rebased, and fixed to "Fall through" for consistency.

zherczeg commented 1 month ago

@carenas if you are ok with the patch, I will land it. @NWilson please update the test conflicts and merge commits

NWilson commented 1 month ago

Rebased, squashed, and fixed the conflicts with the PR you just merged.

carenas commented 1 month ago

Usually Philip follows up with some documentation in ChangeLog, which this change (and most others) were missing.

PhilipHazel commented 1 month ago

Previously, when I wanted a new (?some-letter I used an upper case letter (e.g. (?J)) to try to keep away from potential Perl changes, but (?R) was already in use so I must have decided to take a chance on (?r).

Re ChangeLog: I've been thinking about this. Since the move to Git the contents of ChangeLog haven't really kept in step the way they used to, though as @carenas says, I've done some post hoc additions from time to time. Over the years, I've found it useful as a way of remembering what changed when, but perhaps others are less interested. (Reading the PCRE1 ChangeLog from the start - version 0.91 - is historically instructive as it reminds one of how much has changed since 1997.) The other think I've used ChangeLog for is for updating the NEWS file from it just before a release. I think there are perhaps three possible ways to go:

  1. Close ChangeLog and in future rely only on Git log entries.
  2. From time to time, probably just before a release, go through the Git log and update ChangeLog appropriately.
  3. Keep hassling everybody to update ChangeLog for every change.

What do people think?

carenas commented 1 month ago

I didn't meant from my comment to be a policy discussion, just a reminder for a task that needed to be done and indeed an excuse for myself to do it if it wasn't tackled independently (as shown in #519).

I do think though that keeping the ChangeLog in good shape is important, and also that is done in a timely way so that it could be described properly (specially considering that commits and PR descriptions as of now are not consistently used to work as the source of an automated replacement).

Updating them with the committed changes is not ideal though (as they will likely result in conflicts), so probably 2 might be the only reasonable short term option?