Closed zherczeg closed 1 month ago
I would like to add a test, where a <= 0xffff
utf character has an > 0xffff
othercase pair, but there is no such character. This should be tested on 16 bit mode, with ucp, without utf.
By the way, the handling of horizontal and vertical spaces in the engine is a bit of chaotic to me.
The two lists are maintained by hand: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_internal.h#L400
The characters are mostly coming from unicode White_Space property: https://github.com/PCRE2Project/pcre2/blob/master/maint/Unicode.tables/PropList.txt#L12
In ASCII mode: \v and \h represents the characters in the list, regardless of ucp
flag. Hence, in 16/32 bit mode, without utf, a \h matches to 0x2006.
In EBCDIC mode: \v and \h matches one byte characters only, even if utf is enabled. I don't see a reason why utf or 16/32 bit libraries cannot be supported in EBCDIC. However, the \h and \v will not work in the same way as in ascii.
I would not mind to make this more consistent.
I have made a small fix to get rid of a shadow variable warning in pcre2_compile_class.c. With regard to EBCDIC, as far as I know Ze'ev is only interested in supporting EBCDIC code itself, which is an 8-bit code. Therefore, the 16/32 bit libraries are not of interest. But I may be wrong...
Thank you. So ... shall we change the non-ucp case or keep it? Or wait until somebody complains?
I forgot that horizontal/vertical spaces are stored in a separate list. This patch fixes it.