[clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8

Quuxplusone commented 5 years ago


Bugzilla Link	PR41536
Status	REOPENED
Importance	P enhancement
Reported by	Casey Carter (Casey@Carter.net)
Reported on	2019-04-18 19:35:27 -0700
Last modified on	2019-04-19 18:59:49 -0700
Version	8.0
Hardware	PC Windows NT
CC	blitzrakete@gmail.com, dgregor@apple.com, efriedma@quicinc.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk, rnk@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Compiling this program:

  extern const char str[] = "\u0020\u00f4\u00e2";

  int main() {
    static_assert(sizeof(str) == 4, "BOOM");
    static_assert(sizeof(str) != 6, "BOOM");
  }

with "cl /FA /c" (presumably any version; notably the 19.20 release on Compiler
Explorar) succeeds, and produces assembly output containing the line:

  ?str@@3QBDB DB    ' ', 0f4H, 0e2H, 00H            ; str

note the three universal-character-names (UCNs) have been replaced with the
appropriate corresponding WINDOWS-1252 encodings,which happen to use the same
code unit values as Unicode, as required by [lex.ccon]/9.

Compiling the same program with "clang-cl /FA /c" does not succeed: both
static_asserts fire. Commenting them out, the produced assembly contains the
line:

  .asciz    " \303\264\303\242"

which, despite using superior AT&T syntax, is substantially different. The UCNs
are now encoded in UTF-8, which will produce mojibake when output to a console
expecting WINDOWS-1252. \u0080 is another good example; cl rejects it since
U+0080 has no representation in WINDOWS-1252 whereas clang-cl encodes U+0080 in
UTF-8 without complaint.

Quuxplusone commented 5 years ago

I think clang is working as intended here. I looked at [lex.charset] in the C++
standard, and it specifically says that these \u characters are characters in
the UCS ISO standard:

"""
The character designated by the universal-character-name \UNNNNNNNN is that
character whose character
short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the
universal-character-name \uNNNN
is that character whose character short name in ISO/IEC 10646 is 0000NNNN. I
"""

It's arguable that we should strive for bug-for-bug compatibility with MSVC in
this case, but I personally don't think we should.

Regarding the very real concern of emitting unicode in a Windows command
prompt, my advice is to always stick to the wide APIs, unfortunately. LLVM
itself goes to the trouble to directly call WriteConsoleW:
https://github.com/llvm/llvm-project/blob/2946cd701067404b99c39fb29dc9c74bd7193eb3/llvm/lib/Support/raw_ostream.cpp#L652

Quuxplusone commented 5 years ago

There's a real issue here, I think. Yes, "\U" escapes specify a Unicode character, but the standard doesn't specify how Unicode characters are encoded (outside of u/U/u8 string literals).

Specifically, the issue here is that clang-cl has a different default from cl for /execution-charset.

clang currently does not support anything equivalent to the MSVC /execution-charset flag. It assumes the source and execution charset are both UTF-8 (as if the MSVC "/utf-8" flag was passed). We mostly get away with this at the moment because most source code is ASCII, and we have a hack to pass through the raw bytes of string literals even if they aren't valid UTF-8.

It's not clear we would actually want to change the defaults here, but it seems like a legitimate request to provide the option to specify /execution-charset and /source-charset.

It would be a substantial project to implement /execution-charset and /source-charset, probably. There isn't anything fundamentally tricky; for any ASCII-compatible encoding, it's basically just a matter of translating string literals and identifiers correctly. (We generally don't need to translate comments, and non-ASCII characters aren't legal anywhere else.) But LLVM currently doesn't have any support for translating from Unicode to non-Unicode charsets, so it's likely to spark a complicated debate over how to perform that translation.

See also bug 39864.

Quuxplusone commented 5 years ago

> It would be a substantial project to implement /execution-charset and
> /source-charset, probably.

I think these two features should be considered separately; while they both
require a mechanism to translate between encodings, their impacts on the
frontend are quite different in magnitude and invasiveness. (If we're prepared
to link to ICU, execution charset support is probably pretty straightforward; I
don't think the same is likely to be true for source charset support.)

Quuxplusone commented 5 years ago

(In reply to Eli Friedman from comment #2)
> There's a real issue here, I think.  Yes, "\U" escapes specify a Unicode
> character, but the standard doesn't specify how Unicode characters are
> encoded (outside of u/U/u8 string literals).
>
> Specifically, the issue here is that clang-cl has a different default from
> cl for /execution-charset.

I probably should have called this out in my report, but I wasn't certain if
the issue was a different execution character set or if only UCNs were being
encoded as UTF-8, and I didn't have time to investigate further. Thank you for
the clarification.

Quuxplusone commented 5 years ago

(In reply to Richard Smith from comment #3)
> > It would be a substantial project to implement /execution-charset and
> > /source-charset, probably.
>
> I think these two features should be considered separately; while they both
> require a mechanism to translate between encodings, their impacts on the
> frontend are quite different in magnitude and invasiveness.

For any charset that's an ASCII superset, we only need to be aware of the
source character set in the same places we handle UCNs; it should be roughly
the same complexity. Yes, it would be trickier to handle other character sets,
like UTF-16, or Shift JIS, or EBCDIC.

We technically don't need to consider the source and execution charsets
together, but /execution-charset on its own is probably not that useful for
compatibility with existing code.

(In reply to Casey Carter from comment #4)
> I wasn't certain if the issue was a different execution character set or
> if only UCNs were being encoded as UTF-8

Given ASCII source code, and an execution character set that is an ASCII
superset, I don't think there's any other way to distinguish the execution
character set.

Quuxplusone commented 5 years ago

(In reply to Eli Friedman from comment #5)
> For any charset that's an ASCII superset, we only need to be aware of the
> source character set in the same places we handle UCNs; it should be roughly
> the same complexity. Yes, it would be trickier to handle other character
> sets, like UTF-16, or Shift JIS, or EBCDIC.

Those three are exactly the three encodings I've seen requests for Clang to
support, though :)

> We technically don't need to consider the source and execution charsets
> together, but /execution-charset on its own is probably not that useful for
> compatibility with existing code.

Well, this bug is an example of a case where it would be useful :)

More generally, we can ask people to re-encode their source code as UTF-8 if
they want to use Clang. People might disagree on how reasonable that request
is, but I think it's clearly less reasonable for us to ask people to change
their execution character set. So I think there is a case to be made for us
supporting changing the execution character set but not the source character
set.

Quuxplusone / LLVMBugzillaTest

[clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8 #40506