Move character lists data before the byte code in a pattern

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

918 stars 191 forks source link

Move character lists data before the byte code in a pattern #540

Closed zherczeg closed 3 weeks ago

zherczeg commented 3 weeks ago

This patch moves the character lists to the end of the pattern, and only a reference is stored. This way repeating, repeat detection, and serializing works without modifications. Also, for repeating character sets inside brackets can save a lot of space. The disadvantage is that an extra variable is needed to be maintained, which has a very low overhead, but still an overhead. My first idea was storing pointers, but that is not serialization friendly, although not that hard to implement.

zherczeg commented 3 weeks ago

Btw the original problem was that repeating character classes can break alignment rules unfortunately. It would not be bad if (PATTERN){n,m} could be implemented without repeating the (PATTERN).

zherczeg commented 3 weeks ago

I tried various variations for this code, but the engine always have a feature which blocked them. In the end the character list data is stored before the byte code. The cost is an extra argument for xclass, and an extra member in the real code. Computing the byte code start is simplified.

zherczeg commented 3 weeks ago

Context: a fuzzer got a comparing uninitialized memory error, when a subpattern is repeated. The reason is that space for alignemnt was not initialized. Then it turned out that repeating character lists may break alignment, so they need to be moved out from the byte code and stored elsewhere, and only a reference to them is stored in the byte code, which can be repeated. So this patch fixes a complicated and serious error in the new code.

NWilson commented 3 weeks ago

Interesting. I would have thought the simplest thing would simply be to avoid adding alignment padding entirely, and simply have READ8, READ16, READ32 macros to do the unaligned reads using bit-ops. Just let the uint32 values lie wherever they fall naturally in the bytecode, and read them in a safe way.

Compilers are also super-good at optimising this pattern: uint32_t value; memcpy(&value, unaligned_address, 4). It's a good alternative to #define READ32(addr) (addr[0] << 24) | (addr[1] << 16) | (addr[2] << 8) | addr[3].

NWilson commented 3 weeks ago

It's extra complexity to move data out of the main sequence of bytecode. And it's extra complexity to introduce any weird requirements on the bytecode, that would prevent it being memmove'd around. The bytecode is a PCRE2_SPTR, so its contents should be read and used, simply assuming that it has the PCRE2_SPTR base alignment. Adding padding so that certain slices inside it have stricter alignment then means that the code can't be copied and moved, which is unfortunate.

zherczeg commented 3 weeks ago

The bytecode header structure always needed at least 32 bit alignment, so you cannot move the byte code to any location. Unaligned accesses might be costly.

zherczeg commented 3 weeks ago

I have updated the patch

NWilson commented 3 weeks ago

This will probably make a merge conflict with my PR #523, so it's a race to see which is merged first!

zherczeg commented 3 weeks ago

@PhilipHazel will decide. I have no problem with doing the merge.

zherczeg commented 3 weeks ago

I changed the extra 16 bit setting to debug only, and added some asserts.