Closed carstimon closed 2 years ago
The get_lstring API didn't age well: Z3 used to only support ASCII (8 bits per character). It now supports up to 32 bits and defaults to a Unicode range as defined by SMTLIB string definition. A proper API that returns unescaped characters would have to return an array of unsigned numbers. Internally, we currently use get_lstring only in the Python API where it is wrapped with decoding. To fix this, I would add a different API function altogether and leave it to consumers to convert unsigned into wchar or char or other encoding.
I updated the C++ API by removing the "escaped" getter because they are both escaped (and there was another bug in one of the escape conversions to fix). There is a new function get_wstring() that returns a vector of unsigned. I am not sure if there is any more useful API (maybe one of the C++ versions for building Unicode strings works, but wchar definitely doesn't have enough bits with only 16).
Thanks for the quick turnaround, this is great!
I am only half familiar with this, but C++ has
In my (again limit) experience most situations have std::string with an encoded utf8 rather then a std::u32string.
The conversion between a utf8-encoded std::string and a std::u32string of codepoints in the stl is shown here: https://stackoverflow.com/a/43302460/9517687 This is relatively small amount of code so I think it is reasonable to not do it in Z3, applying the principle of "making the fewest choices in a library" since the data is already basically in the std::u32string format.
If you think it makes sense, I can make the changes to implement something like
std::u32string expr::get_unescaped_string() // Maybe instead of std::vector<unsigned> version?
expr context::unescaped_utf8_string_val(const std::u32string& str);
The change between std::u32string and std::vector
The new version for creating exprs avoids some questions about how to properly call the string_val
method with a potentially unicode string, I think? If I understand correctly the current methods force it to ASCII, and I'm not sure how that interacts with other UTF8 strings. (E.g. if i made an ascii constant and compared it to a free utf8 variable what happens).
I have iterated on this with a few updates. While I haven't adapted the more descriptive naming convention with "unescaped_utf8_string_val" I believe the functionality now addresses accessing characters properly.
I am closing this for now. If there is something that needs fixing add a comment, but otherwise I have to assume it is handled.
For https://github.com/Z3Prover/z3/issues/2286 the function Z3_get_lstring was introduced, and its documentation says it returns an unescaped string. We use this in the C++ API via "get_string" which was introduced in a second commit. Our team uses this function to get the string in a C++ program, and we need to get the string to be exactly the string produced by z3 (e.g. if we are asserting the length is 3 we expect the length of the produced string to be 3). Furthermore, the result is now ambiguous: "\u{a}" could mean that Z3 found the one-character string "\n" or the 5 character string "\u{a}".
As of this commit and this commit the function now does escaping. In particular
c.string_val("\n", 1).get_string() == "\n"
is no longer true. Is there a version to preserve the round-trip?I don't have the context of the first commit that escaped non-ascii bytes, but I would expect that that round trip works for even unicode if that is now supported.