Open Quuxplusone opened 10 years ago
Bugzilla Link | PR20481 |
Status | NEW |
Importance | P normal |
Reported by | Fred (frederic.metrich@free.fr) |
Reported on | 2014-07-29 07:31:01 -0700 |
Last modified on | 2017-04-26 07:14:27 -0700 |
Version | 4.0 |
Hardware | PC Linux |
CC | eric@efcs.ca, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com, tstellar@redhat.com, zilla@kayari.org |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
It seems that libc++ treats "consume_header" as "optionally discard a BOM that
matches my current endianness" instead of "optionally read a BOM to determine
the endianness"
i.e. you have to create the facet with the correct endianness in the first
place, to be able to "consume" the BOM.
The standard says:
— If (Mode & consume_header), the facet shall consume an initial header
sequence,
if present, when reading a multibyte sequence to determine the endianness of
the subsequent multibyte sequence to be read.
i.e. the facet must be able to switch endianness based on the BOM, not fix it
at compile-time.
#include <locale>
#include <codecvt>
#include <iostream>
int main()
{
{
const char src[] = "\xFE\xFF\xAB\xCD";
std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
auto dst = conv.from_bytes(src, src+4);
if (dst[0] != 0xabcd)
std::cout << "Failed to process BOM from UTF-16BE string when"
" expecting BE\nRead "
<< std::hex << (unsigned)dst[0] << '\n';
}
{
const char src[] = "\xFF\xFE\xAB\xCD\0";
std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
auto dst = conv.from_bytes(src, src+4);
if (dst[0] != 0xcdab)
std::cout << "Failed to process BOM from UTF-16LE string when"
" expecting BE\nRead "
<< std::hex << (unsigned)dst[0] << '\n';
}
{
const char src[] = "\xFE\xFF\xAB\xCD\0";
std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
auto dst = conv.from_bytes(src, src+4);
if (dst[0] != 0xabcd)
std::cout << "Failed to process BOM from UTF-16BE string when"
" expecting LE\nRead "
<< std::hex << (unsigned)dst[0] << '\n';
}
{
const char src[] = "\xFF\xFE\xAB\xCD\0";
std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
auto dst = conv.from_bytes(src, src+4);
if (dst[0] != 0xcdab)
std::cout << "Failed to process BOM from UTF-16LE string when"
" expecting LE\nRead "
<< std::hex << (unsigned)dst[0] << '\n';
}
}
This prints:
Failed to process BOM from UTF-16LE string when expecting BE
Read fffe
Failed to process BOM from UTF-16BE string when expecting LE
Read fffe
i.e. it fails "to determine the endianness of the subsequent multibyte sequence
to be read", instead it unconditionally assumes the endianness is the same as
the codecvt_mode template argument.
I can reproduce this; it will almost certainly take a change in the dylib to fix.