PCRE2Project / pcre2

PCRE2 development is now based here.
Other
894 stars 186 forks source link

Implement pre-processed character range list caching #488

Closed zherczeg closed 1 week ago

zherczeg commented 1 week ago

The new method for processing ranges allows some optimizations in the code, e.g. the xclass processing can return early. Furthermore, the data-set created for each range is cached, and during the second pass of byte code generation the cache is used.

zherczeg commented 1 week ago

Timing. Old:

  re> /[\x{100}-\x{400}],[\x{100}-\x{300}],[\x{200}-\x{600}]/i,utf
Compile time  70.0491 microseconds

New:

  re> /[\x{100}-\x{400}],[\x{100}-\x{300}],[\x{200}-\x{600}]/i,utf
Compile time  42.9079 microseconds

Actually I wanted this caching from the beginning, but the original patch was complex enough, and decided to do it later.

zherczeg commented 1 week ago

Thie jit changes are just compile simplification, its effect is negligible, but the code looks better, so it is worth it.

zherczeg commented 1 week ago

I want to add logarithmic search for jit, but I am not sure it is worth for the interpreter. I can do it, but is it worth it? Good question.

zherczeg commented 1 week ago

Note: on EBCDIC systems, when utf is disabled, we don't optimize ranges. I suspect EBCDIC (without utf) only used in 8 bit mode anyway, so it should not be a problem.

PhilipHazel commented 1 week ago

Yes, EBCDIC is purely an 8-bit encoding.it shouldn't ever involve XCLASS.

zherczeg commented 1 week ago

It seems we don't test EBCDIC. I am not even sure it is possible. Anyway, I have added a #error where the support needs to be added. Should be an easy task, but without testing it is not worth much.

carenas commented 1 week ago

Do we also need updates to HACKING?, There is Ze'ev in the mailing list that has access to an EBCDIC system, and maintains the port for it who might be interested, but AFAIK will need this merged and a snapshot to do so.

zherczeg commented 1 week ago

My plan is also start a discussion about ebcdic support after the work is landed. I have added some words to hacking, but I am not a native speaker so feel free to improve it.

zherczeg commented 1 week ago

This should be enough for one patch. Could you check it?