Improve character classes

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

919 stars 191 forks source link

Improve character classes #474

Closed zherczeg closed 2 months ago

zherczeg commented 2 months ago

This is the first patch, which aims to rework character classes. It does not do too much, because it does not handle caseless matching.

When a class has >255 character (the bitset is perfect for ascii / EBCDIC), it sorts and merges the ranges when possible. The current code is careful about not increasing code size, but this will change later. Probably this will be the most challenging part in the future. My idea is that the meta code for classes will be stored elsewhere, and only a reference will be stored in the original pattern.

The purpose of this patch is opening discussion about what should we do with classes. Optimizing them in any way is worth it or not.

zherczeg commented 2 months ago

Some statistics with -O3:

Binary size: old: 2020664 new: 2021096. Few bytes bigger.

Compilation time is slower:

/[\x{200}-\x{400}\x{1000}-\x{1600}\x{10000}-\x{10800}]/utf
 Old: 5.9259 microseconds
 New: 6.3427 microseconds

/[\x{100}-\x{400}]+/i,utf
 Old: 22.0599 microseconds
 New: 31.3692 microseconds

Runtime is a bit better:

/a+[\x{100}-\x{400}]/i,utf
 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 Old: 9.6301 microseconds
 New: 9.5551 microseconds

zherczeg commented 2 months ago

@PhilipHazel probably only you can check this code. The gain at this point is little, but it is possible to extend the code with more features in the future.

zherczeg commented 2 months ago

The difference is bigger on better tests (I forgot the auto possessify optimization):

/.+[\x{200}\x{201}\x{202}\x{203}\x{204}\x{205}\x{206}\x{207}\x{208}\x{209}\x{20a}\x{20b}]/s,utf
 Old: 13.0699 microseconds
 New: 8.3277 microseconds

JIT is not really affected, probably the test is too simple. Anyway, for corner cases the new method should be better. It is also possible to optimize the code further, but the patch is large enough.

zherczeg commented 2 months ago

Conflicts resolved.