Improve caseless character range processing when utf is enabled

zherczeg commented 2 months ago

Currently when utf caseless matching is requested, each character in a class range is checked one-by-one to find their other cases. I always thought this is inefficient, even if it is only done by the parser.

To speed things up, I have added a data structure, which contains ranges where characters have no other cases. The ranges have a minimum size. The size in uint32_t units with different minimum sizes (the number of ranges is half of that size):

minimum size 4: 124; /* 1111114 characters */
minimum size 8: 74; /* 1110983 characters */
minimum size 16: 60; /* 1110911 characters */

I choose 8 at the end, because only 74 values are needed, and these ranges cover 1110983 characters, which is 99.7% of all utf characters.

Performance improvement by the patch:

Original (-O3):

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time 8957.8600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

New (-O3)

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time  83.0600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

The new one is 107 times faster than the old, I think this is a nice speedup.

zherczeg commented 2 months ago

I have some ideas to further improve this patch.

zherczeg commented 2 months ago

I have updated the code. @PhilipHazel what do you think?

PhilipHazel commented 2 months ago

Very nice idea! I am happy with it. Is it ready for merging now?

zherczeg commented 2 months ago

It is ready. The code is fail safe, although it would be great if I understand the purpose of the extra increase.

PCRE2Project / pcre2

Improve caseless character range processing when utf is enabled #477