PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

Implement pre-processed character range list caching #488

Closed zherczeg closed 2 months ago

zherczeg commented 2 months ago

The new method for processing ranges allows some optimizations in the code, e.g. the xclass processing can return early. Furthermore, the data-set created for each range is cached, and during the second pass of byte code generation the cache is used.

zherczeg commented 2 months ago

Timing. Old:

  re> /[\x{100}-\x{400}],[\x{100}-\x{300}],[\x{200}-\x{600}]/i,utf
Compile time  70.0491 microseconds

New:

  re> /[\x{100}-\x{400}],[\x{100}-\x{300}],[\x{200}-\x{600}]/i,utf
Compile time  42.9079 microseconds

Actually I wanted this caching from the beginning, but the original patch was complex enough, and decided to do it later.

zherczeg commented 2 months ago

Thie jit changes are just compile simplification, its effect is negligible, but the code looks better, so it is worth it.

zherczeg commented 2 months ago

I want to add logarithmic search for jit, but I am not sure it is worth for the interpreter. I can do it, but is it worth it? Good question.

zherczeg commented 2 months ago

Note: on EBCDIC systems, when utf is disabled, we don't optimize ranges. I suspect EBCDIC (without utf) only used in 8 bit mode anyway, so it should not be a problem.

PhilipHazel commented 2 months ago

Yes, EBCDIC is purely an 8-bit encoding.it shouldn't ever involve XCLASS.

zherczeg commented 2 months ago

It seems we don't test EBCDIC. I am not even sure it is possible. Anyway, I have added a #error where the support needs to be added. Should be an easy task, but without testing it is not worth much.

carenas commented 2 months ago

Do we also need updates to HACKING?, There is Ze'ev in the mailing list that has access to an EBCDIC system, and maintains the port for it who might be interested, but AFAIK will need this merged and a snapshot to do so.

zherczeg commented 2 months ago

My plan is also start a discussion about ebcdic support after the work is landed. I have added some words to hacking, but I am not a native speaker so feel free to improve it.

zherczeg commented 2 months ago

This should be enough for one patch. Could you check it?