Handle spaces by the new range system

zherczeg commented 1 month ago

I forgot that horizontal/vertical spaces are stored in a separate list. This patch fixes it.

zherczeg commented 1 month ago

I would like to add a test, where a <= 0xffff utf character has an > 0xffff othercase pair, but there is no such character. This should be tested on 16 bit mode, with ucp, without utf.

zherczeg commented 1 month ago

By the way, the handling of horizontal and vertical spaces in the engine is a bit of chaotic to me.

The two lists are maintained by hand: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_internal.h#L400

The characters are mostly coming from unicode White_Space property: https://github.com/PCRE2Project/pcre2/blob/master/maint/Unicode.tables/PropList.txt#L12

In ASCII mode: \v and \h represents the characters in the list, regardless of ucp flag. Hence, in 16/32 bit mode, without utf, a \h matches to 0x2006. In EBCDIC mode: \v and \h matches one byte characters only, even if utf is enabled. I don't see a reason why utf or 16/32 bit libraries cannot be supported in EBCDIC. However, the \h and \v will not work in the same way as in ascii.

I would not mind to make this more consistent.

PhilipHazel commented 1 month ago

I have made a small fix to get rid of a shadow variable warning in pcre2_compile_class.c. With regard to EBCDIC, as far as I know Ze'ev is only interested in supporting EBCDIC code itself, which is an 8-bit code. Therefore, the 16/32 bit libraries are not of interest. But I may be wrong...

zherczeg commented 1 month ago

Thank you. So ... shall we change the non-ucp case or keep it? Or wait until somebody complains?

PCRE2Project / pcre2

Handle spaces by the new range system #494