Z3_get_lstring now adds escaping of some bytes.

carstimon commented 3 years ago

For https://github.com/Z3Prover/z3/issues/2286 the function Z3_get_lstring was introduced, and its documentation says it returns an unescaped string. We use this in the C++ API via "get_string" which was introduced in a second commit. Our team uses this function to get the string in a C++ program, and we need to get the string to be exactly the string produced by z3 (e.g. if we are asserting the length is 3 we expect the length of the produced string to be 3). Furthermore, the result is now ambiguous: "\u{a}" could mean that Z3 found the one-character string "\n" or the 5 character string "\u{a}".

As of this commit and this commit the function now does escaping. In particular c.string_val("\n", 1).get_string() == "\n" is no longer true. Is there a version to preserve the round-trip?

I don't have the context of the first commit that escaped non-ascii bytes, but I would expect that that round trip works for even unicode if that is now supported.

NikolajBjorner commented 3 years ago

The get_lstring API didn't age well: Z3 used to only support ASCII (8 bits per character). It now supports up to 32 bits and defaults to a Unicode range as defined by SMTLIB string definition. A proper API that returns unescaped characters would have to return an array of unsigned numbers. Internally, we currently use get_lstring only in the Python API where it is wrapped with decoding. To fix this, I would add a different API function altogether and leave it to consumers to convert unsigned into wchar or char or other encoding.

NikolajBjorner commented 3 years ago

I updated the C++ API by removing the "escaped" getter because they are both escaped (and there was another bug in one of the escape conversions to fix). There is a new function get_wstring() that returns a vector of unsigned. I am not sure if there is any more useful API (maybe one of the C++ versions for building Unicode strings works, but wchar definitely doesn't have enough bits with only 16).

carstimon commented 3 years ago

Thanks for the quick turnaround, this is great!

I am only half familiar with this, but C++ has

std::string (sequence of char): Array of bytes
std::wstring (sequence of wchar_t): Array of an unpromised number of bits (depend on platform, ugh)
std::u16string (sequence of char16_t): Array of 16 bits (thankfully independent of platform)
std::u32string (sequence of char32_t): Array of 32 bits (thankfully independent of platform)

In my (again limit) experience most situations have std::string with an encoded utf8 rather then a std::u32string.

The conversion between a utf8-encoded std::string and a std::u32string of codepoints in the stl is shown here: https://stackoverflow.com/a/43302460/9517687 This is relatively small amount of code so I think it is reasonable to not do it in Z3, applying the principle of "making the fewest choices in a library" since the data is already basically in the std::u32string format.

If you think it makes sense, I can make the changes to implement something like

  std::u32string expr::get_unescaped_string() // Maybe instead of std::vector<unsigned> version?
  expr context::unescaped_utf8_string_val(const std::u32string& str);

The change between std::u32string and std::vector versions is not a big difference, just saves an extra conversion step if you want to use the conversion code I linked above.

The new version for creating exprs avoids some questions about how to properly call the string_val method with a potentially unicode string, I think? If I understand correctly the current methods force it to ASCII, and I'm not sure how that interacts with other UTF8 strings. (E.g. if i made an ascii constant and compared it to a free utf8 variable what happens).

NikolajBjorner commented 3 years ago

I have iterated on this with a few updates. While I haven't adapted the more descriptive naming convention with "unescaped_utf8_string_val" I believe the functionality now addresses accessing characters properly.

NikolajBjorner commented 2 years ago

I am closing this for now. If there is something that needs fixing add a comment, but otherwise I have to assume it is handled.

Z3Prover / z3

Z3_get_lstring now adds escaping of some bytes. #5615