Open NWilson opened 1 day ago
What is the problem with PCRE2_MATCH_INVALID_UTF / PCRE2_JIT_INVALID_UTF ?
PCRE2_MATCH_INVALID_UTF
won't actually match any invalid UTF (ie the pattern won't read across it). The valid_utf()
function is still used to check that the input is valid. All it does, if you pass in a 100-byte string with a bad byte at offset 50, is that it just "fixes" the input to be only 50 bytes long (truncating at the bad byte). That match function has absolutely no support for scanning across bad bytes, because it uses macros like GETCHAR
which assume the input is valid.
For people who actually want to do lenient processing of messed-up data with PCRE2, this is not especially useful.
For me... the №1 reason to do this change ("always validate UTF") is because I'm scared of the safety implications of the non-validating UTF modes. This is also the reason I opened the ticket "remove lengthptr != NULL pass" - performance is nice, but safety is better.
The original requests was coming from grep
, because they use memory mapped files, and cannot validate them. We discussed the expected operation (very strict, but it is easy to understand concepts), and that is what JIT does. I would be surprised if it would stop at the first invalid character, and nobody noticed it so far. @PhilipHazel added the interpreter support later.
JIT tests: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_test.c#L1873
This one for example parses many invalid bytes: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_test.c#L1996
It's documented in pcre2unicode:
In this mode [
PCRE2_MATCH_INVALID_UTF
], an invalid code unit sequence in the subject never matches any pattern item. It does not match dot, it does not match \ep{Any}, it does not even match negative items such as [^X]. A lookbehind assertion fails if it encounters an invalid sequence while moving the current point backwards. In other words, an invalid UTF code unit sequence acts as a barrier which no match can cross.You can also think of this as the subject being split up into fragments of valid UTF, delimited internally by invalid code unit sequences. The pattern is matched fragment by fragment. The result of a successful match, however, is given as code unit offsets in the entire subject string in the usual way. There are a few points to consider: [...]
It really does not seem to be the case, that you can write a pattern ^foo.*bar$
and have the .*
skip over invalid UTF-8.
Yes, the .*
will not cross it, but an unanchored pattern can be anywhere.
A run the raw !.+
pattern on \xff#%#\xff!##\xff
(c string, not pcre2 test), with invalid utf enabled, and both interpreter and jit returned with 5-8 as expected. Is this ok?
That seems to be what the documentation says - match the !##
but not the \xff
.
But I would have expected that what people want is for .
to match the \xff
as well, so that patterns can cross over bad characters if they appear in the middle of the line.
The introduction of NO-UTF-CHECK was prompted by applications that search strings many thousands of bytes long multiple times. UTF-checking very long strings repeatedly can use significant resources. Even a global search algorithm, that carries on looking for more matches once it has found one doesn't need to check the UTF again. I think there will be complaints if UTF checking for every call of pcre2_match() is enforced .
There have been (as far as I know) no complaints about the behaviour of MATCH_INVALID_UTF since it was introduced. The requirement that was expressed at the time was for finding valid UTF fragments among random data. That is, there was no request for a single match to be able to cross over invalid sequences.
You need to add a lot of checks in that case. There are many ways a UTF sequence is invalid, e.g. \xc0\x80 is invalid, because \x00 can be encoded as a single byte character. You can also check jit speed with and without flag, but don't forget jit always do a lot of extra optimizations, even for invalid utf fragments, that the interpreter cannot do.
Like @zherczeg I am skeptical about the possibility of cheap on the fly testing. There are 21 different UTF-8 error codes in pcre2.h. But if you can figure out a cunning way to do it with very little cost ...
The original requests was coming from
grep
, because they use memory mapped files, and cannot validate them.
GNU grep doesn't use mmap anymore, but indeed does their "own" UTF validation which is apparently faster than ours (eventhough I think it breaks with the logic in a few places)
But I would have expected that what people want is for
.
to match the\xff
they csn always get that if UTF mode is NOT enabled.
The problem with invalid UTF is the same as operator precedence in character classes. It does not matter what you do, there will be people who want it differently. For example, a 0xc0 0x8f is an invalid form of 0x0f or two independent "characters". What is a caseless 0x80? Does it match to \P{any}
? How (caseless) backreferences matches to invalid utf garbage? Many questions, no "good" choices, only more problems.
I did make a reasonable suggestion in my first post: "Any unpaired or bad bytes are just silently treated as U+FFFD." This is a very popular and canonical treatment of bad bytes, and gives them full working Unicode semantics.
But I agree... there a many open questions, and it's unclear if we'd be able to even do this task.
Maybe after 10.45 (or 10.46...) I'll revisit this.
Currently, there's something that mildly scares me in the code. We have macros for reading UTF-8 chars (which is all inlined into every callsite... hmm). But worst, the macros don't validate the chars and can do out-of-bounds memory access. If you have a string which ends in 0x80, then
GETCHAR()
will merrily read over the end of the string.This "OK" because we validate upfront both for compile & match. That is... unless the clients tell us they are really, really sure that the input is valid! (If they lie to us, then undefined behaviour / memory badness is allowed to occur.)
Wild idea: it may be quicker to just always validate!
valid_utf()
, then later on, doing a non-validating decode, must surely be slower than simply doing a validating decode later? (Or... is the concern that regexes will do backtracking, and decode the input string many times? And so hoisting the validation to the beginning is worthwhile, anyway?) I still reckon that there's hardly anything to be gained by doing a non-validating UTF decode - it's practically the same cost to validate at the same time.BUT! We can now add a new option, PCRE2_PERMISSIVE_UTF, which would allow PCRE2 to process invalid UTF input. Any unpaired or bad bytes are just silently treated as U+FFFD. This would require, as a prerequisite, having the match code be able to handle invalid UTF input.
Currently, if you have a log file with some messed-up lines in it.... tough, PCRE2 won't process it (😞). So I think there would be clients that might well be interested in having a non-judgemental UTF mode. Bad UTF exists, we can't change the world... just handle it gracefully.