Migrate to pcre2 - Githubissues

tobil4sk commented 1 year ago

~Not ready yet, currently there are some build issues, see: https://github.com/PCRE2Project/pcre2/issues/218.~ Build issues have now been addressed.

See https://github.com/HaxeFoundation/haxe/issues/10491

Uzume commented 1 year ago

Not ready yet, currently there are some build issues, see: PCRE2Project/pcre2#218 [...]

[...]So for now we'll need to build both pcre2-8 and pcre2-16 separately as libraries and link both.

hxcpp uses both UTF-8 and UTF-16? Does it use them at the same time? If it does not use them at the same time you will never need PCRE2_CODE_UNIT_WIDTH=0 (and won't have to refer to API with the code unit size suffixes) but you will have to either build PCRE2 twice or depend on multiple system library builds.

A quick search of of the code tree shows both system library dependencies for libpcre3-dev and a "vendor"ed copy of pcre-8.42. I don't believe PCRE1 ever had a single library that supported multiple code unit widths either. You will notice that Debian/Ubuntu libpcre3-dev includes headers files and also depends on (among other things) libpcre3, libpcre16-3 and libpcre32-3 (so those individually built library packages will also get installed). In a similar way that Debian/Ubuntu libpcre2-dev includes headers and depends on libpcre2-8-0, libpcre2-16-0 and libpcre2-32-0.

A prebuilt application that used PCRE could just depend on the libraries and not need the headers but hxcpp is creating C++ code and then compiling so we need both the headers and the libraries to access the C API and link against the libraries.

As tools/haxe/build_linux.sh refers to libpcre3-dev, I am surprised tools/haxe/build_osx.sh does not refer to the brew pcre package (it get away with this due to the opam update on HaxeFoundation/haxe and its dependency on pcre-conf which implies brew pcre on macos—which was just broken with HaxeFoundation/haxe#11032). However, sadly, it is seems we are building the local "vender"ed copy of project/thirdparty/pcre-8.42.

Oddly enough in addition to PCRE having a copy, hxcpp seem to have its own local "vendor"ed copy of zherczeg/sljit at src/hx/cppia/sljit_src. Even though sljit is a "platform independent low-level JIT compiler", I believe the creator created it to accelerate PCRE/PCRE2 so any references to "pcre" in src/hx/cppia/sljit_src can be safely ignored (though we might want to consider what that is being used for and updating it).

src/hx/libs/regexp/Build.xml seems to building both hxcpp_regexp and hxcpp_regexp16 compiling src/hx/libs/regexp/RegExp.cpp with pcre-8.42 for 8-bit and 8-bit plus 16-bit code unit versions (to support HX_SMART_STRINGS). That seems to be our own custom build tool/language (provided in tools/build/Build.hx and documented in docs/build_xml). It does not seem to be building PCRE libraries but directly compiling and statically linking in object files. I can imagine what needs to be done but I am not sure I understand "build_xml" well enough to know how to change that to do what needs to be done with PCRE2. For example, hxcpp_regexp and hxcpp_regexp16 seem to be references no where except in src/hx/libs/regexp/Build.xml (so how do these build artifacts get consumed once built?). Maybe I can figure it out if I study the docs.

In short, things are sort of messy. I highly question why we are carrying around a copy of and building PCRE when Haxe itself also depends on PCRE (now PCRE2) but does not have its own copy (a similar argument could be said for hashlink which apparently has its own copy while nekovm does not).

tobil4sk commented 1 year ago

hxcpp uses both UTF-8 and UTF-16?

Yes.

As tools/haxe/build_linux.sh refers to libpcre3-dev

This file was only used for building haxe from source to run unit tests, but I don't think it is used anymore

I highly question why we are carrying around a copy of and building PCRE when Haxe itself also depends on PCRE (now PCRE2) but does not have its own copy (a similar argument could be said for hashlink which apparently has its own copy while nekovm does not).

Hxcpp is different from the others here because it is used to compile user projects, so the sources need to be easily available. Haxe, Hashlink and Neko on the other hand only need the sources when they are being built. Neko's CMake script is set up to manually download the pcre sources, so it doesn't need to keep a copy. Hashlink uses make and Visual Studio primarily so it is not so easy, but we could link it dynamically (or even statically) to the system copy on Linux/Mac.

Simn commented 1 year ago

Oh hey it's green, good job! Is there anything left to discuss/address or should we merge this?

tobil4sk commented 1 year ago

Oh hey it's green, good job! Is there anything left to discuss/address or should we merge this?

Actually yes, I was checking whether https://github.com/HaxeFoundation/haxe/issues/9500 is fixed, and it turns out there is now a problem with the string length passed into pcre2_compile... I think perhaps it requires the length in bytes rather than the number of characters.

Uzume commented 1 year ago

Actually yes, I was checking whether HaxeFoundation/haxe#9500 is fixed, and it turns out there is now a problem with the string length passed into pcre2_compile... I think perhaps it requires the length in bytes rather than the number of characters.

@tobil4sk: Yes, it can also take a special sentinel PCRE2_ZERO_TERMINATED which in the code seems to turn that into strlen so I am pretty sure that is character buffer size/byte length and not code unit length.

tobil4sk commented 1 year ago

I think it should be fine now. The original cause for https://github.com/HaxeFoundation/haxe/issues/9500 is the 0 (null pointers) passed in here. The function should take pointers to write error information to, and it seems to not like the null pointers even if there is no error.

https://github.com/HaxeFoundation/hxcpp/blob/3f6de84d4decb0a7aa1131ebd527f623b1e2d2b1/src/hx/libs/regexp/RegExp.cpp#L88

So I've added a test for this issue here. I think it makes more sense here than in the Haxe repo, since this is to do with a specific hxcpp_smart_strings behaviour. So I think once this is merged we can close https://github.com/HaxeFoundation/haxe/issues/9500.

With the changes here, we now pass in the length into the compile function, because strings aren't necessarily null terminated (see https://github.com/HaxeFoundation/haxe/issues/10592). When converting utf16 to utf8, we need to make sure the correct length is still passed in, which should be the case now.

Uzume commented 1 year ago

With hxcpp_smart_strings, if a string only contains characters that can be represented by ASCII, it will use 8 bit encoding to save space. If it uses any non-ASCII characters then it uses utf-16. So in practice, we are not really converting utf-8 to utf-16, but ASCII to utf-16, so the length is going to remain consistent. Anyway, in the latest commits I fixed it so it will use the updated utf-16 length just in case, but it won't make a difference really.

@tobil4sk: Ah, that makes more sense. So "smart strings" in hxcpp is like PEP 393 – Flexible String Representation in Python 3.3+ (which treat all strings as Unicode encoded in either Latin1, UCS2 or UCS4/UTF32).

Let me see if I have the gist of it. You are saying hxcpp string objects are only considered to support Unicode if "smart strings" is enabled and their internal representation is either:

an ASCII encoded byte string: non "smart string" implementation without Unicode support
a union structure that supports either an ASCII encoded byte string or a UTF-16 encoded 16-bit word string: "smart string" Unicode implementation

A normalized "smart string" representation should only be UTF-16 encoded if it contains a Unicode character code point larger than what can be encoded in 7-bit ASCII.

The tricky part with "smart strings" is when performing operations on multiple strings at once, e.g., binary string compares, regular expression matches/substitutions (where PCRE comes in), etc. In those cases if all the smart strings are not in the same encoded representation, those in ASCII encoded representations must first be converted to a (non-normalized) UTF-16 encoded representation before the operation and the result may need to be normalized back down to an ASCII encoded representation before being returned.

Is that about right?

tobil4sk commented 1 year ago

A normalized "smart string" representation should only be UTF-16 encoded if it contains a Unicode character code point larger than what can be encoded in 7-bit ASCII. [...] Is that about right?

Yes, that sounds right. So we convert only if matching a ascii string against a utf-16 pattern, or a utf-16 string against a ascii pattern. But now the conversions and the length values should be correct in these cases. In regular cases where conversions are not necessary, the .length property returns the correct value in code units so we can pass it into prce2_compile_x without issues. If that all makes sense, we should be able to resolve all the string length threads.

Uzume commented 1 year ago

Yes, that sounds right. So we convert only if matching a ascii string against a utf-16 pattern, or a utf-16 string against a ascii pattern. But now the conversions and the length values should be correct in these cases. In regular cases where conversions are not necessary, the .length property returns the correct value in code units so we can pass it into prce2_compile_x without issues. If that all makes sense, we should be able to resolve all the string length threads.

@tobil4sk: And that is why you (correctly) assert that when dealing with ASCII encoded representations (not multi-byte UTF-8) the byte string length is the same as the character code point length.

I am not sure this is relevant but it should be noted that when dealing with UTF-16 encoded representations, strictly speaking the byte string length is not equal to two times character code point length (due to surrogates since UTF-16 is a multi-word16 encoding that can represent character code points that do not fit into two bytes).

I am surprised hxcpp "smart strings" doesn't use Latin1 as the base and why they choose to support UTF-16 since that are already supporting a multi-encoding scheme (why not just add 32-bit UCS4 when dealing with code points beyond the 16-bit BMP?). That would make its representation basically the same as PEP 393.

I am also surprised non-"smart strings" is still supported. The "hit" is very minor when dealing with all ASCII/Latin1 8-bit strings (and when they aren't there is no substitute unless we want to support a UTF-8 or UTF-16 encoding).

tobil4sk commented 1 year ago

when dealing with UTF-16 encoded representations, strictly speaking the byte string length is not equal to two times character code point length

The length given by the .length member is in code units, not code points. Taking the example from before:

trace("🙂".length); // 2

I am also surprised non-"smart strings" is still supported

hxcpp-smart-strings is always on by default. It is only ever disabled if -D disable-unicode-strings is enabled.

Uzume commented 1 year ago

hxcpp-smart-strings is always on by default. It is only ever disabled if -D disable-unicode-strings is enabled.

Exactly! And disable-unicode-strings is still supported by evidence that you are maintaining code to support such code paths.

HaxeFoundation / hxcpp

Migrate to pcre2 #1037