Open Quuxplusone opened 5 years ago
Bugzilla Link | PR41536 |
Status | REOPENED |
Importance | P enhancement |
Reported by | Casey Carter (Casey@Carter.net) |
Reported on | 2019-04-18 19:35:27 -0700 |
Last modified on | 2019-04-19 18:59:49 -0700 |
Version | 8.0 |
Hardware | PC Windows NT |
CC | blitzrakete@gmail.com, dgregor@apple.com, efriedma@quicinc.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk, rnk@google.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
I think clang is working as intended here. I looked at [lex.charset] in the C++
standard, and it specifically says that these \u characters are characters in
the UCS ISO standard:
"""
The character designated by the universal-character-name \UNNNNNNNN is that
character whose character
short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the
universal-character-name \uNNNN
is that character whose character short name in ISO/IEC 10646 is 0000NNNN. I
"""
It's arguable that we should strive for bug-for-bug compatibility with MSVC in
this case, but I personally don't think we should.
Regarding the very real concern of emitting unicode in a Windows command
prompt, my advice is to always stick to the wide APIs, unfortunately. LLVM
itself goes to the trouble to directly call WriteConsoleW:
https://github.com/llvm/llvm-project/blob/2946cd701067404b99c39fb29dc9c74bd7193eb3/llvm/lib/Support/raw_ostream.cpp#L652
There's a real issue here, I think. Yes, "\U" escapes specify a Unicode character, but the standard doesn't specify how Unicode characters are encoded (outside of u/U/u8 string literals).
Specifically, the issue here is that clang-cl has a different default from cl for /execution-charset.
clang currently does not support anything equivalent to the MSVC /execution-charset flag. It assumes the source and execution charset are both UTF-8 (as if the MSVC "/utf-8" flag was passed). We mostly get away with this at the moment because most source code is ASCII, and we have a hack to pass through the raw bytes of string literals even if they aren't valid UTF-8.
It's not clear we would actually want to change the defaults here, but it seems like a legitimate request to provide the option to specify /execution-charset and /source-charset.
It would be a substantial project to implement /execution-charset and /source-charset, probably. There isn't anything fundamentally tricky; for any ASCII-compatible encoding, it's basically just a matter of translating string literals and identifiers correctly. (We generally don't need to translate comments, and non-ASCII characters aren't legal anywhere else.) But LLVM currently doesn't have any support for translating from Unicode to non-Unicode charsets, so it's likely to spark a complicated debate over how to perform that translation.
See also bug 39864.
> It would be a substantial project to implement /execution-charset and
> /source-charset, probably.
I think these two features should be considered separately; while they both
require a mechanism to translate between encodings, their impacts on the
frontend are quite different in magnitude and invasiveness. (If we're prepared
to link to ICU, execution charset support is probably pretty straightforward; I
don't think the same is likely to be true for source charset support.)
(In reply to Eli Friedman from comment #2)
> There's a real issue here, I think. Yes, "\U" escapes specify a Unicode
> character, but the standard doesn't specify how Unicode characters are
> encoded (outside of u/U/u8 string literals).
>
> Specifically, the issue here is that clang-cl has a different default from
> cl for /execution-charset.
I probably should have called this out in my report, but I wasn't certain if
the issue was a different execution character set or if only UCNs were being
encoded as UTF-8, and I didn't have time to investigate further. Thank you for
the clarification.
(In reply to Richard Smith from comment #3)
> > It would be a substantial project to implement /execution-charset and
> > /source-charset, probably.
>
> I think these two features should be considered separately; while they both
> require a mechanism to translate between encodings, their impacts on the
> frontend are quite different in magnitude and invasiveness.
For any charset that's an ASCII superset, we only need to be aware of the
source character set in the same places we handle UCNs; it should be roughly
the same complexity. Yes, it would be trickier to handle other character sets,
like UTF-16, or Shift JIS, or EBCDIC.
We technically don't need to consider the source and execution charsets
together, but /execution-charset on its own is probably not that useful for
compatibility with existing code.
(In reply to Casey Carter from comment #4)
> I wasn't certain if the issue was a different execution character set or
> if only UCNs were being encoded as UTF-8
Given ASCII source code, and an execution character set that is an ASCII
superset, I don't think there's any other way to distinguish the execution
character set.
(In reply to Eli Friedman from comment #5)
> For any charset that's an ASCII superset, we only need to be aware of the
> source character set in the same places we handle UCNs; it should be roughly
> the same complexity. Yes, it would be trickier to handle other character
> sets, like UTF-16, or Shift JIS, or EBCDIC.
Those three are exactly the three encodings I've seen requests for Clang to
support, though :)
> We technically don't need to consider the source and execution charsets
> together, but /execution-charset on its own is probably not that useful for
> compatibility with existing code.
Well, this bug is an example of a case where it would be useful :)
More generally, we can ask people to re-encode their source code as UTF-8 if
they want to use Clang. People might disagree on how reasonable that request
is, but I think it's clearly less reasonable for us to ask people to change
their execution character set. So I think there is a case to be made for us
supporting changing the execution character set but not the source character
set.