Closed NWilson closed 2 months ago
I see some checks are failing... I have marked this PR as draft.
Do we need two options here? Perhaps it could be made explicit by calling the option PCRE2_EXTRA_PYTHON_OCTAL and then applying both your changes. What does Python do for \0? Does it fault it?
What does Python do for \0? Does it fault it?
maps it to NUL, just like Javascript, I suspect at least this part of the change should be done independently and probably also include disabling \x
(which Perl/PCRE also map to NUL, Python and mostly everyone else faults but Javascript treats as a literal "x", as it does PCRE in "BSUX" mode of course).
Oh, OK, if Python is the same as Perl for \0 then indeed it needs a separate option for disabling. It might be better to add a clearer error message because "unrecognized character follows \" could be confusing because "0" is recognized. Maybe "octal digit must follow \0" would be clearer. Incidentally, all the build test actions on GitHub are failing. I haven't investigated why.
Oops, sorry for repeating myself about the failing tests...
I have also fixed an apparent bug in the handling of octals: the escape
\88888888888888
was being treated as a literal 8. The code to handle out-of-range backreferences was not reached ifread_number()
returned false. I have confirmed that this is a deviation from Perl's behaviour.
This should be also on its own PR (that this change could be built upon) IMHO, so the bugfix can be backported as needed.
I suspect the proposed fix might not be correct though (at least when also considering BSUX mode, where an invalid escape is indeed meant to be treated as a literal), but it is also difficult to see it around all other changes and that feature is already partially broken too (an example below from the codepaths that work):
PCRE2 version 10.44 2024-06-07 (8-bit)
re> /\888888888/B,alt_bsux
------------------------------------------------------------------
Bra
888888888
Ket
End
------------------------------------------------------------------
data> 888888888
0: 888888888
% node -e "const re = /\8/; console.log('888888888'.match(re))"
[ '8', index: 0, input: '888888888', groups: undefined ]
This should be also on its own PR (that this change could be built upon) IMHO, so the bugfix can be backported as needed
OK, that's perfectly sensible. I've opened PR #472 solely for the fix.
I don't really like the naming, Philip might have suggesting something better.
Please, any suggestions welcome! I'm never any good with names.
Python (and most dialects) allow \0 as NUL, but it's really quite unfortunate. It leaves you with (in pcre2_substitute) $1 and $0
meaning "ref 1, whole pattern" but \1 and \0
meaning "ref 1, NUL escape".
Usually, PCRE2 just allows all the syntax for Perl/Python/Ruby since it's harmless to parse extras. But in this case, because there are good usability reasons against \0
I think it makes sense to have the ability to disable it.
Java follows Python's octal disambiguation rules - but bans \0
. So there's one major dialect out there with this behaviour.
I see no reason to ban \x00
. When the user wants NUL, they should be allowed to specify it using a hex escape.
It might be better to add a clearer error message because "unrecognized character follows " could be confusing because "0" is recognized
:+1: Thanks, will do
Also, PHP and Ruby unbelievably interpret \0
to mean the same as $0
inside substitution (replacement) strings! That's seriously incompatible with Perl, Python, and all the rest. Let's not even think about that :(
I see no reason to ban
\x00
. When the user wants NUL, they should be allowed to specify it using a hex escape
it is not '\x00' what I was proposing banning, but '\x', probably incorrectly assuming you were more concerned about NUL than octals, since there was no comment explaining why an Python allows \0
as NUL.
Thanks for the explanations, but it seems from it that it should be restricted only in the replacement string?
- I quite like the name PCRE2_EXTRA_PYTHON_OCTAL because it suggests what it actually does - use Python octal rules.
the only problem is that \0 is still allowed in Python, maybe JAVA as described above?
I now think the banning of \0 (and \x) should be a separate option.
Thank you both for the feedback!
it is not '\x00' what I was proposing banning, but '\x', probably incorrectly assuming you were more concerned about NUL than octals, since there was no comment explaining why an Python allows \0 as NUL.
? I'm confused. What use of \x
are you thinking of, other than \xHH
(or \x{HH}
? Or did you mean "x" as "any digit" rather than the actual letter X?
Thanks for the explanations, but it seems from it that it should be restricted only in the replacement string?
It was deliberate. The PCRE2_EXTRA_NO_BS0
excludes \0
in both patterns and replacements. In the dialects I looked at, those which use \
in the replacement string (rather than just $
) seem to match their escaping behaviour between the pattern and the replacement handling.
I think the suggestion is that \x followed by a non-hex digit, should give an error rather than be treated as \x00.
I quite like the name PCRE2_EXTRA_PYTHON_OCTAL because it suggests what it actually does - use Python octal rules.
Whew. I'll keep it at PCRE2_EXTRA_PYTHON_OCTAL
and PCRE2_EXTRA_NO_BS0
. We are agreed they make sense as two different options.
You say numbers above \377 are rejected, but surely that's only in 8-bit, non-UTF mode?
Python rejects octal values above \377
unconditionally. That's probably because of the debacle of Perl's handling... in old Perl versions any higher bits were accepted by discarded, as mentioned by your comment: The original code used just to take the least significant 8 bits of octal numbers...
. So now we have PCRE2 that's changed to follow the new Perl behaviour, while other major engines like .NET follow the old Perl behaviour.
I did some quick archaeology:
Any larger and the behavior that Perl follows is simply to truncate the high bits
(https://source.dot.net/#System.Text.RegularExpressions/System/Text/RegularExpressions/RegexParser.cs,1450). So their behaviour there is quite deliberately following Perl.re
engine source I know quite well... and it's not commented at all. Who knows what motivated them.What a mess. I don't blame Python for just banning values above \377
, if there's no consensus between engines on how they should behave.
Fine, if Python bans anything above \377 unconditionally, then we should do the same when the PYTHON_OCTAL option is set.
I think the suggestion is that \x followed by a non-hex digit, should give an error rather than be treated as \x00.
!!!
:exploding_head:
OK, I didn't know that. Wow. The handling of \x{...}
seems OK: it matches \\x\{[ \t]*[:hex:]+[ \t]*\}
, enforcing at least one hex digit, and well-closed {...}
. But the handling for \xHH
is indeed... surprising.
That's a good candidate for fixing up with the PCRE2_EXTRA_NO_BS0
option. I don't know what we'd call it though.
The .NET behaviour after \x
is to require exactly two hex digits (if no {
). Java on the other hand requires 1 hex digit, then leniently peeks for a second one. Yuck.
It is all extremely yuck indeed. How about PCRE2_EXTRA_NO_IMPLICIT_NUL ? Not a particularly brilliant name, I admit.
I would add that it refers to hex construct. PCRE2_EXTRA_HEX_REQUIRE_CHAR
, because we request a character after \x.
No, because it is also locking out \0 (implicit octal NUL).
\0
isn't really an "implicit" NUL, it's the normal ANSI C escape for NUL. But the \x
thing is super weird.
We could handle them with the same option, even though they are strictly separate. But equally, PCRE2 could probably justify simply fixing the \x
behaviour to require one-or-more hex characters, without any user complaint. If there's a concern about diverging from Perl behaviour, maybe the Perl team would fix it too if we asked!
(Anyway, as a newcomer I'll just do whatever the consensus is, or Philip's decision.)
We could handle them with the same option, even though they are strictly separate.
They are separate, but I think are similar in the fact that they are both inconsistent between engines, so maybe PCRE2_EXTRA_CONSISTENT_NUL?.
Agree that "\x" is weird (maybe even an implementation bug) but even if Perl also considers it eventually a bug and fix it, all we need to do is remove the extra check that we are going to add here to fix it with the "extra" flag, so IMHO there is no problem including both in your "consistency" quest.
I've opened a separate PR for the \x
issue, with the proposal to simply ban it. One day (surely!), PCRE2 will have to swallow the backwards-compatibility cost of the fix, so why delay?
I will pause on updating this PR until the other ones affecting octal handling are merged.
I've opened a separate PR for the
\x
issue, with the proposal to simply ban it. One day (surely!), PCRE2 will have to swallow the backwards-compatibility cost of the fix, so why delay?
Introducing a known incompatibility with Perl is not what the "Perl Compatible Regular Expression" library is known to do though, but doing so with an EXTRA flag is fine, specially when it improves in compatibility as was proposed by this yet to be named flag.
I've rebased and added tests. I'm still getting my head around which modes/builds are used for each of the test files, so we'll see if it's all OK.
PCRE2 follows Perl's rules for disambiguating octals:
\0
or\0x
or\0xx
are always octals\1
to\9
are always backrefs, as are\8x
or\9x
I have added here
PCRE2_EXTRA_NO_AMBIG_BSOCT
to allow following Python's saner rules. These might be a better choice for new applications without backwards compatibility concerns. Basically, with Python you only get octals if you really ask for them. An octal escape has to have either a leading zero (\0 or \0x or \0xx
, or three octal digits\xxx
). The ambiguous cases such as\12
are always backreferences. The disambiguation is unaffected by how many patterns are defined elsewhere.This greatly reduces the risk a user will accidentally write an octal escape that they didn't intend (for example, an off-by-one referencing the last capture group, leading to an octal escape with no error indication).
I have also added the flag
PCRE2_EXTRA_NO_BS0
to disable the\0
NUL escape, which presumably Perl added to mimic C.I have also fixed an apparent bug in the handling of octals: the escape
\88888888888888
was being treated as a literal 8. The code to handle out-of-range backreferences was not reached ifread_number()
returned false. I have confirmed that this is a deviation from Perl's behaviour.TODO