Closed j0hnnybash closed 1 year ago
Can't we just remove u8 from those literals? We represent them with binary literals if we have to. The impression I have is reinterpret_cast is inappropriate or even UB here because char8_t is always unsigned. So it's either static_cast or UB. We should have a proper matrix here testing all compilers and standards with at least one factor setting -funsigned
for GCC to make sure it works. What's ironic is I have implemented the tools just for that (https://github.com/alandefreitas/cpp-actions) but haven't had the time to implement it. Here's (https://github.com/boostorg/url/blob/develop/.github/workflows/ci.yml) a repository where I did integrate such a matrix.
Yes you are correct, i missed that just using non u8 literals works because it is save to assume the execution and source encoding will be UTF-8 (at least the latter was already assumed by using the character itself and not a unicode codepoint escape sequence).
Side note: as per https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html the reinterpret cast isn't UB because char
(unsigned or not) can always be used to access any other type (https://eel.is/c++draft/expr#basic.lval-11).
Side note2: these u8 literals are really unintuitive...
Side note: as per https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html the reinterpret cast isn't UB because char (unsigned or not) can always be used to access any other type (https://eel.is/c++draft/expr#basic.lval-11).
Yes. You're right. It's not UB because char and unsigned char could be representing any byte sequence. I think it's still "improper" (UB is not the correct term) to reinterpret_cast though. If there's a divergence between the types being used, which is happen in most cases as char8_t is always unsigned and char is typically signed, then the reinterpreted output might represent the wrong unicode codepoint.
Side note2: these u8 literals are really unintuitive
Yes. It sucks. I had problems with it in other libraries too. The signed/unsigned relationship is problematic. And char without char8_t is also problematic because char is typically signed while unicode assumes chars are unsigned.
FYI: I updated the commit to simply remove the u8
prefix on the literals, instead of using reinterpret_casts.
Yes. That works. Thanks!
Beginning with C++20 u8"" literals use char8_t as underlying types making them incompatible with char based std::string. Adding reinterpret_casts to char is among the remediation approaches suggested by 1. There are only 4 uses of u8 literals in the codebase, also this is the simplest backwards compatible solution to this issue (the alternative would be to use ASCII escape codes in regular string literals). Long term it could make sense to replace uses of std::string with std::u8string.
This fixes #287.