PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

Don't write code during size computation #547

Closed zherczeg closed 2 weeks ago

zherczeg commented 2 weeks ago

This patch is in the top of #546

When I worked on the other patch, I was thinking that writing unknown amount of data when the length is computed can be an issue. It looks like fuzzers already caught this issue, so here is a patch. Can be combined with #546, but can be independent form it.

Adding a test is difficult because COMPILE_WORK_SIZE is (3000*LINK_SIZE) and a really huge test is needed. What shall we do? Add a really huge test? E.g. /[\x{200}&&\x{200}&&....\x{200}]/alt_extended_class

zherczeg commented 2 weeks ago

I think the first patch is closer to landing, since all requests were done. This one, however, is not necessary in its best form. Maybe we could do it in a nicer way, and testing is also a question. The new asserts do some testing, but is it enough?

NWilson commented 2 weeks ago

Ouch. During the lengthptr != NUL phase, is in necessary to completely avoid writing any *code++ = ... assignments?

It appeared that other places in the compiler were unconditionally writing data out, and I'd uneasily just accepted that. Does it allocate an up-front amount of space equal to the size of the parsed input, so that if you have 10 META codes, there'll be space for 10 OP codes, and you only need to make the OP write conditional for cases where the compiler inflates the output?

PCRE2 triggers tons of lint-rule violations in Microsoft's in-house linters, due to widespread use of writes to pointers that don't have attached length information (for example, not checking each and every write to code against a code_end terminator). All writable buffers do have a maximum legal write, so the linter is not happy to see any writes at all to code pointers that are passed in without a code_size or code_end limit. Many of these are, of course, false positives (grr linters), because of the system of checking the required length with lengthptr, then allocating a buffer of known sufficient size.

Adding a test is difficult because COMPILE_WORK_SIZE is (3000*LINK_SIZE) and a really huge test is needed. What shall we do? Add a really huge test? E.g. /[\x{200}&&\x{200}&&....\x{200}]/alt_extended_class

Sounds good to just write a big test. 3000 characters isn't going to break the test suite, I hope?

PhilipHazel commented 2 weeks ago

Ouch. During the lengthptr != NUL phase, is in necessary to completely avoid writing any *code++ = ... assignments?

Luckily, no. There is a block of storage available, but the the idea is that after each "item" is written and its length ascertained, the pointer is reset to the start of the block. I forget the details offhand, but it is supposed to be clever enough to stay within the allotted memory. (It's over a decade since I implemented this, probably nearer two, so the details are no longer in mind. I'd have to read the code to refresh....)

zherczeg commented 2 weeks ago

The idea is that the majority of opcodes have a fixed size. PCRE2 uses a stack allocated buffer, where you can write them during length computation. Then code-buffer_start is added to length, and code=buffer_start resets the buffer. So far only xclass was the only exception. It writes its beginning and end to the buffer, but there can be any number of internal tokens. These are not written to the buffer during length computation. And now eclass. I hope it explains how the system works, I think it is quite smart. The 3000*LINK_SIZE stack buffer is too much for me though. We could also add more asserts as well.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L10210

zherczeg commented 2 weeks ago

Patch is updated. I still don't like it. It has no test, and maybe LINK_SIZE checks needed as well. The PCRE2_ASSERT(lengthptr == NULL || code == *pcode); could be incorrect, maybe the original value of pcode should be saved in a local in debug mode.

NWilson commented 2 weeks ago

Thank you @zherczeg and @PhilipHazel for explaining. That makes good sense, and I see the need to fix it for ECLASS, which does indeed write an unbounded amount of data to its opcode.

zherczeg commented 2 weeks ago

I am thinking about a possible test. It would be a huge test with a repeated character sequence. Since the asserts check the reset mechanic, it would not add more to the existing tests. The compile_class_nested is recursive, so I cannot really simplify the checks there as well.

I have no better idea to fix this, but I would be happy if future code changes would make this code nicer as a side effect.

NWilson commented 2 weeks ago

Agreed. The fix looks good, but I'll try and reduce the complexity when I optimize away the recursive descent.

We should have tests of the form: [a--a--a--a--a ...] and [^[^a]--[^a]--...], and also of the form [^[^a--a][^a--a]...].

zherczeg commented 2 weeks ago

Test added. Without hte patch, it fails with: Failed: error 152 at offset 5486: internal error: overran compiling workspace

zherczeg commented 2 weeks ago

Is there any changes needed for this patch?

carenas commented 2 weeks ago

Is there any changes needed for this patch?

I think is ready and doubly "approved"