Closed zherczeg closed 2 months ago
Some statistics with -O3:
Binary size: old: 2020664 new: 2021096. Few bytes bigger.
Compilation time is slower:
/[\x{200}-\x{400}\x{1000}-\x{1600}\x{10000}-\x{10800}]/utf
Old: 5.9259 microseconds
New: 6.3427 microseconds
/[\x{100}-\x{400}]+/i,utf
Old: 22.0599 microseconds
New: 31.3692 microseconds
Runtime is a bit better:
/a+[\x{100}-\x{400}]/i,utf
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Old: 9.6301 microseconds
New: 9.5551 microseconds
@PhilipHazel probably only you can check this code. The gain at this point is little, but it is possible to extend the code with more features in the future.
The difference is bigger on better tests (I forgot the auto possessify optimization):
/.+[\x{200}\x{201}\x{202}\x{203}\x{204}\x{205}\x{206}\x{207}\x{208}\x{209}\x{20a}\x{20b}]/s,utf
Old: 13.0699 microseconds
New: 8.3277 microseconds
JIT is not really affected, probably the test is too simple. Anyway, for corner cases the new method should be better. It is also possible to optimize the code further, but the patch is large enough.
Conflicts resolved.
This is the first patch, which aims to rework character classes. It does not do too much, because it does not handle caseless matching.
When a class has >255 character (the bitset is perfect for ascii / EBCDIC), it sorts and merges the ranges when possible. The current code is careful about not increasing code size, but this will change later. Probably this will be the most challenging part in the future. My idea is that the meta code for classes will be stored elsewhere, and only a reference will be stored in the original pattern.
The purpose of this patch is opening discussion about what should we do with classes. Optimizing them in any way is worth it or not.