zherczeg commented 2 months ago

Would be great to do something with caseless repetitions in xclass. Example:

  re> /[\pC\x{17b}\x{17c}]/i,utf,fullbincode
------------------------------------------------------------------
  0  21 Bra
  3     [\p{C}\x{17b}-\x{17c}\x{17b}-\x{17c}]
 21  21 Ket
 24     End
------------------------------------------------------------------

The question is where. During compilation, in the META phase, the buffer might be too small to have the proper ranges. In the byte code generation phase, computing it twice would be costly. Can we reallocate the buffer in the META phase, and do all caseless checks there? Maybe do all checks there, including the class to single character optimization.

carenas commented 2 months ago

Assume that by "META phase" you refer to parse_regex()?, eitherway, now that we are introducing a "rewrite phase" soon after that and that needs to reallocate the parsed pattern anyway, might make sense to do it there.

With that the patten provided will be "rewritten" as: "[\pC\x{17b}-\x{17c}]" before "compiling" into opcodes; alternatively pcre_study() could remove the duplication after just like it does the class to single character optimization.

Obviously to avoid duplicating work ALL case expansion for classes will need to be done at that phase and not when doing the "compiling" by treating the content of classes like a blob.

carenas commented 2 months ago

Also curious, where is this duplication problematic, and if is it worth doing any optimization to prevent it (specially if after compilation?), for example wouldn't the following also be technically more "efficient" by rewriting? (ex: [\p{Cn}\pC] => \pC)

PhilipHazel commented 2 months ago

I expect Zoltan is thinking that duplication is wasted effort at match time. All the handling of case-independence is done at compile time so that the match phase doesn't have to keep track of where and where not it applies. (An earlier PCRE did, but pulling it out to compile time simplified things and must have benefited performance.) Because wide characters (and ranges) and just listed in an XCLASS, the compiler adds the other case to the list, but as has been pointed out, there is no duplication checking.

Your suggestion above of a different kind of optimization is also legitimate. However, how far should we go in such things? Who is actually going to write [\p{Cn}\pC] ? If we are thinking of programs generating patterns, then it really should be the generating program that does its own optimization, IMHO.

zherczeg commented 2 months ago

I would like to optimize actual ranges only, no properties. The ranges should be sorted, so we can use logarithmic search, which is much faster for many ranges.

zherczeg commented 2 months ago

Since the character sizes are not fixed in utf, the whole binary search should be implemented as byte code. Structure could be something like this:

Range list start with a pointer to end: LINK_SIZE

Operation types: LESS_THAN: followed by a c character. If the current character is less than c, we have a match, and code pointer goes to end. A c set to 0 never matches because of less than. GREATER_OR_EQUAL_THAN: followed by c and a pointer LINK_SIZE. If the current character is >= 'c', 'c' subtracted from the character, and code pointer is moved ahead by LINK_SIZE.

Since we have only two operations, perhaps we could encode them without an operation byte in some way.

Probably LESS_THAN, GREATER_OR_EQUAL_THAN are in pairs, and can be encoded as c, c, LINK_SIZE. This way we have only one type of operation. If the first c is 0, no GREATER_OR_EQUAL_THAN part, and match fails.

zherczeg commented 2 months ago

Btw the 0..MAX_UTF range cannot be encoded in the structure above, since MAX_UTF+1 cannot be encoded in UTF16. But this is the same as ALL characters. If we have at least one range, 'c' is always <= MAX_UTF. Maybe for a few ranges (<=2) we could simply use the current method.

zherczeg commented 2 months ago

474

zherczeg commented 2 months ago

There is progress on this, the issue can be closed.

zherczeg commented 1 month ago

I was talking about the binary search problem of xclass ranges: should be relatively simple (or the binary search will not be efficient), but at the same time should not increase the code size too much. I have to following proposal for the problem:

There should be four range lists:

A: [0x0 - 0x7fff] using 2 byte items
B: [0x8000 - 0xffff] using 2 byte items
C: [0x10000 - 0x7fffffff] using 4 byte items
D: [0x80000000 - 0xffffffff] using 4 byte items

Each range list contains characters and ranges in increasing order. 16 bit character matching can only use A and B ranges, while unicode matching can only use A, B and C ranges. D range is only used in 32 bit mode, no utf.

The character values are shifted left by 1, and the lowest bit is cleared for range starts.

Example: the [456, 789-1245, 4236] character values are encoded as ((456 << 1) | 1), (789 << 1), ((1245 << 1) | 1), ((4236 << 1) | 1).

The binary search searches the lower_bound of a search_value = ((original_character_value << 1) | 1) in the list. The original_character_value is in the list, if search_value == lower_bound || (lower_bound & 0x1) == 0.

When a range intersects with multiple range lists, a range is created for them in each range list. The worst case is that a single range is present in A,B,C and D lists as well. Should be a rare case.

For B and D ranges the ((character_value << 1) | 1) shifts the highest bit out (it is always 1 anyway), but that is no problem in practice.

All ranges contains characters, which size is less than or equal than the size of the unicode representation of that character. For large number of characters/ranges, this should consume less amount of space than the current implementation. For a low amount of characters/ranges (<= 8), we could keep the current code.

We need some header for the range lists, I haven't decided it. Probably the first character represents the number of elements in the list (e.g. 16 bit is enough for 16 bit range lists, since 32K items is the maximum amount). The data for range lists must be naturally aligned.

Let me know if you have some suggestions.

NWilson commented 1 month ago

The binary search searches the lower_bound of a search_value = ((original_character_value << 1) | 1) in the list. The original_character_value is in the list, if search_value == lower_bound || (lower_bound & 0x1) == 0.

Genius! It's self-synchronising so you can bisect it, and recover whether you have a range or single character.

How important is it to use tricks like this to shave off bytes? It seems OK, but not super-necessary.

We can encode it like this:

#define XCL_NOT      0x01  /* Flag: this is a negative class */
#define XCL_MAP      0x02  /* Flag: a 32-byte map is present */
#define XCL_HASPROP  0x04  /* Flag: property checks are present. */
#define XCL_HASNOTPROP  0x08  /* Flag: not property checks are present. */
#define XCL_HASRANGE_A 0x10  /* Flag: Zoltan's "range A". */
#define XCL_HASRANGE_D 0x80  /* Flag: Zoltan's "range D". */

get rid of XCL_END and the others...

length	contents
1 SPTR	OP_XCLASS
LINK_SIZE	length of the whole opcode
1 SPTR	the XCL_ flags above
32 bytes	if the XCL_MAP set, then the map
...	a length-prefixed list of props, if the XCL_HASPROP is set
...	a length-prefixed list of props, if the XCL_HASNOTPROP is set
...	a length-prefixed list of ranges, if the XCL_HASRANGE_A is set
...	...

The sections don't need any tags to indicate what they are, we just include them always in the same order, based on whether the flag indicates to include the section.

PCRE2Project / pcre2

The xclass caseless repetition problem #469

474