Closed zherczeg closed 3 weeks ago
Btw the original problem was that repeating character classes can break alignment rules unfortunately. It would not be bad if (PATTERN){n,m}
could be implemented without repeating the (PATTERN)
.
I tried various variations for this code, but the engine always have a feature which blocked them. In the end the character list data is stored before the byte code. The cost is an extra argument for xclass, and an extra member in the real code. Computing the byte code start is simplified.
Context: a fuzzer got a comparing uninitialized memory error, when a subpattern is repeated. The reason is that space for alignemnt was not initialized. Then it turned out that repeating character lists may break alignment, so they need to be moved out from the byte code and stored elsewhere, and only a reference to them is stored in the byte code, which can be repeated. So this patch fixes a complicated and serious error in the new code.
Interesting. I would have thought the simplest thing would simply be to avoid adding alignment padding entirely, and simply have READ8, READ16, READ32 macros to do the unaligned reads using bit-ops. Just let the uint32 values lie wherever they fall naturally in the bytecode, and read them in a safe way.
Compilers are also super-good at optimising this pattern: uint32_t value; memcpy(&value, unaligned_address, 4)
. It's a good alternative to #define READ32(addr) (addr[0] << 24) | (addr[1] << 16) | (addr[2] << 8) | addr[3]
.
It's extra complexity to move data out of the main sequence of bytecode. And it's extra complexity to introduce any weird requirements on the bytecode, that would prevent it being memmove'd around. The bytecode is a PCRE2_SPTR, so its contents should be read and used, simply assuming that it has the PCRE2_SPTR base alignment. Adding padding so that certain slices inside it have stricter alignment then means that the code can't be copied and moved, which is unfortunate.
The bytecode header structure always needed at least 32 bit alignment, so you cannot move the byte code to any location. Unaligned accesses might be costly.
I have updated the patch
This will probably make a merge conflict with my PR #523, so it's a race to see which is merged first!
@PhilipHazel will decide. I have no problem with doing the merge.
I changed the extra 16 bit setting to debug only, and added some asserts.
This patch moves the character lists to the end of the pattern, and only a reference is stored. This way repeating, repeat detection, and serializing works without modifications. Also, for repeating character sets inside brackets can save a lot of space. The disadvantage is that an extra variable is needed to be maintained, which has a very low overhead, but still an overhead. My first idea was storing pointers, but that is not serialization friendly, although not that hard to implement.