Add a compile flag for Python-style octal disambiguation

NWilson commented 2 months ago

PCRE2 follows Perl's rules for disambiguating octals:

\0 or \0x or \0xx are always octals
\1 to \9 are always backrefs, as are \8x or \9x
Other values, which lead with a non-zero octal digit, are interpreted based on how many capture groups are defined elsewhere in the regex

I have added here PCRE2_EXTRA_NO_AMBIG_BSOCT to allow following Python's saner rules. These might be a better choice for new applications without backwards compatibility concerns. Basically, with Python you only get octals if you really ask for them. An octal escape has to have either a leading zero (\0 or \0x or \0xx, or three octal digits \xxx). The ambiguous cases such as \12 are always backreferences. The disambiguation is unaffected by how many patterns are defined elsewhere.

This greatly reduces the risk a user will accidentally write an octal escape that they didn't intend (for example, an off-by-one referencing the last capture group, leading to an octal escape with no error indication).

I have also added the flag PCRE2_EXTRA_NO_BS0 to disable the \0 NUL escape, which presumably Perl added to mimic C.

I have also fixed an apparent bug in the handling of octals: the escape \88888888888888 was being treated as a literal 8. The code to handle out-of-range backreferences was not reached if read_number() returned false. I have confirmed that this is a deviation from Perl's behaviour.

TODO

[ ] Add documentation
[x] Add tests... (just wrestling with the test infra, I'll do it on Monday)

PhilipHazel commented 2 months ago

I see some checks are failing... I have marked this PR as draft.

Do we need two options here? Perhaps it could be made explicit by calling the option PCRE2_EXTRA_PYTHON_OCTAL and then applying both your changes. What does Python do for \0? Does it fault it?

carenas commented 2 months ago

What does Python do for \0? Does it fault it?

maps it to NUL, just like Javascript, I suspect at least this part of the change should be done independently and probably also include disabling \x (which Perl/PCRE also map to NUL, Python and mostly everyone else faults but Javascript treats as a literal "x", as it does PCRE in "BSUX" mode of course).

PhilipHazel commented 2 months ago

Oh, OK, if Python is the same as Perl for \0 then indeed it needs a separate option for disabling. It might be better to add a clearer error message because "unrecognized character follows \" could be confusing because "0" is recognized. Maybe "octal digit must follow \0" would be clearer. Incidentally, all the build test actions on GitHub are failing. I haven't investigated why.

PhilipHazel commented 2 months ago

Oops, sorry for repeating myself about the failing tests...

carenas commented 2 months ago

I have also fixed an apparent bug in the handling of octals: the escape \88888888888888 was being treated as a literal 8. The code to handle out-of-range backreferences was not reached if read_number() returned false. I have confirmed that this is a deviation from Perl's behaviour.

This should be also on its own PR (that this change could be built upon) IMHO, so the bugfix can be backported as needed.

I suspect the proposed fix might not be correct though (at least when also considering BSUX mode, where an invalid escape is indeed meant to be treated as a literal), but it is also difficult to see it around all other changes and that feature is already partially broken too (an example below from the codepaths that work):

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /\888888888/B,alt_bsux
------------------------------------------------------------------
        Bra
        888888888
        Ket
        End
------------------------------------------------------------------
data> 888888888
 0: 888888888

% node -e "const re = /\8/; console.log('888888888'.match(re))"
[ '8', index: 0, input: '888888888', groups: undefined ]

NWilson commented 2 months ago

This should be also on its own PR (that this change could be built upon) IMHO, so the bugfix can be backported as needed

OK, that's perfectly sensible. I've opened PR #472 solely for the fix.

NWilson commented 2 months ago

I don't really like the naming, Philip might have suggesting something better.

Please, any suggestions welcome! I'm never any good with names.

Python (and most dialects) allow \0 as NUL, but it's really quite unfortunate. It leaves you with (in pcre2_substitute) $1 and $0 meaning "ref 1, whole pattern" but \1 and \0 meaning "ref 1, NUL escape".

Usually, PCRE2 just allows all the syntax for Perl/Python/Ruby since it's harmless to parse extras. But in this case, because there are good usability reasons against \0 I think it makes sense to have the ability to disable it.

Java follows Python's octal disambiguation rules - but bans \0. So there's one major dialect out there with this behaviour.

I see no reason to ban \x00. When the user wants NUL, they should be allowed to specify it using a hex escape.

It might be better to add a clearer error message because "unrecognized character follows " could be confusing because "0" is recognized

:+1: Thanks, will do

NWilson commented 2 months ago

Also, PHP and Ruby unbelievably interpret \0 to mean the same as $0 inside substitution (replacement) strings! That's seriously incompatible with Perl, Python, and all the rest. Let's not even think about that :(

carenas commented 2 months ago

I see no reason to ban \x00. When the user wants NUL, they should be allowed to specify it using a hex escape

it is not '\x00' what I was proposing banning, but '\x', probably incorrectly assuming you were more concerned about NUL than octals, since there was no comment explaining why an Python allows \0 as NUL.

Thanks for the explanations, but it seems from it that it should be restricted only in the replacement string?

PhilipHazel commented 2 months ago

I quite like the name PCRE2_EXTRA_PYTHON_OCTAL because it suggests what it actually does - use Python octal rules.
You say numbers above \377 are rejected, but surely that's only in 8-bit, non-UTF mode?
The build fails are caused by minor changes in error offsets in the tests; updating the test data should fix.

carenas commented 2 months ago

I quite like the name PCRE2_EXTRA_PYTHON_OCTAL because it suggests what it actually does - use Python octal rules.

the only problem is that \0 is still allowed in Python, maybe JAVA as described above?

PhilipHazel commented 2 months ago

I now think the banning of \0 (and \x) should be a separate option.

NWilson commented 2 months ago

Thank you both for the feedback!

it is not '\x00' what I was proposing banning, but '\x', probably incorrectly assuming you were more concerned about NUL than octals, since there was no comment explaining why an Python allows \0 as NUL.

? I'm confused. What use of \x are you thinking of, other than \xHH (or \x{HH}? Or did you mean "x" as "any digit" rather than the actual letter X?

Thanks for the explanations, but it seems from it that it should be restricted only in the replacement string?

It was deliberate. The PCRE2_EXTRA_NO_BS0 excludes \0 in both patterns and replacements. In the dialects I looked at, those which use \ in the replacement string (rather than just $) seem to match their escaping behaviour between the pattern and the replacement handling.

PhilipHazel commented 2 months ago

I think the suggestion is that \x followed by a non-hex digit, should give an error rather than be treated as \x00.

NWilson commented 2 months ago

I quite like the name PCRE2_EXTRA_PYTHON_OCTAL because it suggests what it actually does - use Python octal rules.

Whew. I'll keep it at PCRE2_EXTRA_PYTHON_OCTAL and PCRE2_EXTRA_NO_BS0. We are agreed they make sense as two different options.

You say numbers above \377 are rejected, but surely that's only in 8-bit, non-UTF mode?

Python rejects octal values above \377 unconditionally. That's probably because of the debacle of Perl's handling... in old Perl versions any higher bits were accepted by discarded, as mentioned by your comment: The original code used just to take the least significant 8 bits of octal numbers.... So now we have PCRE2 that's changed to follow the new Perl behaviour, while other major engines like .NET follow the old Perl behaviour.

I did some quick archaeology:

The .NET source code has a comment that says Any larger and the behavior that Perl follows is simply to truncate the high bits (https://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexParser.cs,1450). So their behaviour there is quite deliberately following Perl.
The Python re engine source I know quite well... and it's not commented at all. Who knows what motivated them.

What a mess. I don't blame Python for just banning values above \377, if there's no consensus between engines on how they should behave.

PhilipHazel commented 2 months ago

Fine, if Python bans anything above \377 unconditionally, then we should do the same when the PYTHON_OCTAL option is set.

NWilson commented 2 months ago

I think the suggestion is that \x followed by a non-hex digit, should give an error rather than be treated as \x00.

!!!

:exploding_head:

OK, I didn't know that. Wow. The handling of \x{...} seems OK: it matches \\x\{[ \t]*[:hex:]+[ \t]*\}, enforcing at least one hex digit, and well-closed {...}. But the handling for \xHH is indeed... surprising.

That's a good candidate for fixing up with the PCRE2_EXTRA_NO_BS0 option. I don't know what we'd call it though.

The .NET behaviour after \x is to require exactly two hex digits (if no {). Java on the other hand requires 1 hex digit, then leniently peeks for a second one. Yuck.

PhilipHazel commented 2 months ago

It is all extremely yuck indeed. How about PCRE2_EXTRA_NO_IMPLICIT_NUL ? Not a particularly brilliant name, I admit.

zherczeg commented 2 months ago

I would add that it refers to hex construct. PCRE2_EXTRA_HEX_REQUIRE_CHAR, because we request a character after \x.

PhilipHazel commented 2 months ago

No, because it is also locking out \0 (implicit octal NUL).

NWilson commented 2 months ago

\0 isn't really an "implicit" NUL, it's the normal ANSI C escape for NUL. But the \x thing is super weird.

We could handle them with the same option, even though they are strictly separate. But equally, PCRE2 could probably justify simply fixing the \x behaviour to require one-or-more hex characters, without any user complaint. If there's a concern about diverging from Perl behaviour, maybe the Perl team would fix it too if we asked!

(Anyway, as a newcomer I'll just do whatever the consensus is, or Philip's decision.)

carenas commented 2 months ago

We could handle them with the same option, even though they are strictly separate.

They are separate, but I think are similar in the fact that they are both inconsistent between engines, so maybe PCRE2_EXTRA_CONSISTENT_NUL?.

Agree that "\x" is weird (maybe even an implementation bug) but even if Perl also considers it eventually a bug and fix it, all we need to do is remove the extra check that we are going to add here to fix it with the "extra" flag, so IMHO there is no problem including both in your "consistency" quest.

NWilson commented 2 months ago

I've opened a separate PR for the \x issue, with the proposal to simply ban it. One day (surely!), PCRE2 will have to swallow the backwards-compatibility cost of the fix, so why delay?

I will pause on updating this PR until the other ones affecting octal handling are merged.

carenas commented 2 months ago

I've opened a separate PR for the \x issue, with the proposal to simply ban it. One day (surely!), PCRE2 will have to swallow the backwards-compatibility cost of the fix, so why delay?

Introducing a known incompatibility with Perl is not what the "Perl Compatible Regular Expression" library is known to do though, but doing so with an EXTRA flag is fine, specially when it improves in compatibility as was proposed by this yet to be named flag.

NWilson commented 2 months ago

I've rebased and added tests. I'm still getting my head around which modes/builds are used for each of the test files, so we'll see if it's all OK.

PCRE2Project / pcre2

Add a compile flag for Python-style octal disambiguation #470

TODO