PCRE2Project / pcre2

PCRE2 development is now based here.
Other
894 stars 186 forks source link

Add test and fix for too-large octals starting with \8 or \9 #472

Closed NWilson closed 2 weeks ago

NWilson commented 2 weeks ago

For backreferences of the form \8000000 (larger than GROUP_MAX but smaller than INT_MAX) the regex was rejected as invalid, due to an out-of-range backreference.

However, for those of the form \80000000000 (larger than INT_MAX) it would treat it as a escape for the literal character '8', which is inconsistent with Perl (confirmed by test).

This appears to be just an oversight in the code, and there's no test case so this condition wasn't covered.

NWilson commented 2 weeks ago

revert max_value change in read_number to make result consistent with further check for GROUP_MAX (when needed), the following is a valid test that would be broken otherwise:

You test passes on my branch. I've added your test case to the PR.

There's no need to reference INT_MAX anywhere in this code, because the behaviour does not change on reaching 2,000,000,000 (values below and above this should be treated the same).

PhilipHazel commented 2 weeks ago

So how do we decide between #472 and #476 if they both do the same job?

NWilson commented 2 weeks ago

That's your job as project leader, I guess :) :)

Personally, I prefer my fix for the reasons I've commented - smaller diff compared to the original code; one less duplicated copy of the "char >= 8" logic; and thirdly avoids behaviour around INT_MAX which isn't relevant to the user.

But, of course I won't be offended if you prefer Carlo's version for any reason. I don't really care, as long as it works the same.

carenas commented 2 weeks ago

Personally, I prefer my fix for the reasons I've commented - smaller diff compared to the original code; one less duplicated copy of the "char >= 8" logic; and thirdly avoids behaviour around INT_MAX which isn't relevant to the user.

My fix is indeed much smaller (compared with the alternative fix) and complete (as it also includes optionally fixing the bogus erroroffset).

I was originally working on top of this fix but soon realized by looking at the history of the changes, that the bug was probably introduced while working around previous bugs and so minimizing the amount of change was wiser.

Agree that duplicating the 'char >= 8' logic is "unclean" but THAT is part of the bugfix. Resetting the ptr was done for a fallback so we can read the number as an octal, but obviously and early exit was needed if the number wasn't and that combination was fatal when combined with the lack of error reporting.

Agree that moving the limit out of INT_MAX will be better as well which is why that change is included as an additional optional commit, which can't be cleanly backported though as asserts are only available in master.

One additional concern I have about this proposal is that setting s to INT_MAX on error instead of just stop ignoring the error is unusual for callers of read_number() and a little more risky long term as it leaves that variable with a bogus a potentially unsafe value.

NWilson commented 2 weeks ago

(Apologies for being argumentative...)

Agree that duplicating the 'char >= 8' is "unclean" but THAT is the bugfix. Resetting the ptr was done for a fallback so we can read the number as an octal, but obviously and early exit was needed if the number wasn't and that combination was fatal.

The bug is that the existing check for "char >= 8" wasn't being reached. We don't need to add another copy of that condition, just make it so that control flow reaches the existing logic.

The existing code (correctly) swallows errors from read_number(). It just shouldn't branch on its return value, in order to truly swallow the errors.

My fix is indeed much smaller and complete (as it also includes optionally fixing the strange erroroffset).

I've made the single change to add a line of code between the read_number() call and checks on s:

// existing call to if (read_number()) unchanged. change INT_MAX → MAX_GROUP_NUMBER cosmetic only
+     s = INT_MAX;
// existing if (s < 10 || ...) unchanged

I believe this is the smallest possible fix.

carenas commented 2 weeks ago

revert max_value change in read_number to make result consistent with further check for GROUP_MAX (when needed), the following is a valid test that would be broken otherwise:

You test passes on my branch.

Apologies, was testing a modified version of your branch which failed because of my own changes and the fact that s was getting the bogus INT_MAX value instead.

I've added your test case to the PR.

Perl compatible changes should go to testinput1.

NWilson commented 2 weeks ago

Thanks @carenas, I've rebased, squashed, and moved the tests to test{input/output}1.

@PhilipHazel We've spent (perhaps) too long on this simple bugfix. Can we merge both PRs - this one first, then Carlo's PR second with just bugfix for the read_number() ptr offset.

Thank you Philip, by the way, for approving and documenting all these PRs so quickly! I'll be working on PCRE2 for maybe 1-2 weeks, and it will all quieten down again.

carenas commented 2 weeks ago

We've spent (perhaps) too long on this simple bugfix. Can we merge both PRs - this one first, then Carlo's PR second with just bugfix for the read_number() ptr offset.

There is no cherry-pick of commits either and of course the other PR will have conflicts if this one gets merged, but I will rebase it on top and repurpose it as a "fixup" PR so Philip can apply it cleanly then, as needed.

PhilipHazel commented 2 weeks ago

Oh dear, there's been a misunderstanding that I didn't pick up. The tests that fail should be in test 2, not test 1, because the error messages given by PCRE and Perl are different. The maint/RunPerlTest script shows up the difference. I will fix it. Test 1 is for tests that produce the same output from pcre2test and perltest.sh.

carenas commented 2 weeks ago

Thanks and sorry for the misunderstandings and extra work

NWilson commented 2 weeks ago

Thank you @PhilipHazel and @carenas for the fixups! I'm doing my best to match the style & idioms as I observe them.

At least we're allowed C99 :)