PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

Error offset should be advanced by one character for "[\d-z]" invalid range error #548

Closed NWilson closed 2 weeks ago

NWilson commented 2 weeks ago

The code does a 1-char lookahead for a hyphen, but then doesn't advance the pointer to consume the hyphen when returning the error.

Perl's error message (with "use warnings") does advance to just after the hyphen, so PCRE2 should match.

Fixes #545

NWilson commented 2 weeks ago

In Perl, I get the following diagnostic for the regex /[z-\p{Lu}]/:

False [] range "z-\p{Lu}" in regex; marked by <-- HERE in m/[z-\p{Lu} <-- HERE ]/

In general, @PhilipHazel, what is the correct place to put an erroroffset returned by pcre2_compile()? Should be it just before, or just after, the error? In Perl, the idea is that inserting "<-- HERE" into the string (at the error offset) makes sense, so that suggests the error offset should be pointing to the next (unread) character after the error.

But in quite a few places in PCRE2, there seems to be a convention that the erroroffset actually points to the bad character, which suggests inserting "HERE ->" at the error offset, when displaying it to the user.

Is there a consistent answer?

I have some additional fixes as requested by Carlo, but I'll push them when I know where the correct position of the offset is.

PhilipHazel commented 2 weeks ago

Is there a consistent answer?

Not in the sense of my having any "policy" about this. I guess I just returned wherever the pointer happened to be.

NWilson commented 2 weeks ago

Thank you @PhilipHazel. I have a branch where I've hacked up perltest.sh a little bit, to gather error offsets from Perl's errors and warnings, and use them in RunPerlTest to check all our errors in testoutput{2,5} match the error offsets from Perl.

It's going to be somewhat ... epic in scope to pull out all those errors though. So I'm parking that investigation.

In the absence of any policy, I'll just adjust the errors offsets in this PR for [\d ... \p{...} ... [:posix:]] in character classes to match Perl. The error offset will make sense if you insert "<-- HERE" at the offset position.

In future we can maybe make that adjustment for some other error cases, and try and produce the same offsets as Perl in as many cases as possible.

zherczeg commented 2 weeks ago

I don't remember anybody ever complained about error offsets, so whatever change is good for me. When the patch is ready, I will land it.

NWilson commented 2 weeks ago

Thank you. I've pushed with an updated commit message for Carlo; no other changes.

I'm satisfied with this set of changes for now.