codecvt_utf16 facet with utf-16LE-bom file...

Quuxplusone commented 10 years ago


Bugzilla Link	PR20481
Status	NEW
Importance	P normal
Reported by	Fred (frederic.metrich@free.fr)
Reported on	2014-07-29 07:31:01 -0700
Last modified on	2017-04-26 07:14:27 -0700
Version	4.0
Hardware	PC Linux
CC	eric@efcs.ca, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com, tstellar@redhat.com, zilla@kayari.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Hi,
It seems, the facet codecvt_utf16 is buggy.

I've a file which is UTF-16LE encoded with BOM.
I define 2 facets like this:
  std::locale utf16_locale( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, std::consume_header > );
  std::locale utf16_localeLEBOM( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, static_cast< std::codecvt_mode >( 5 ) /*std::little_endian | std::consume_header*/ > );

I then define 2 streams like this:
std::wifstream xmlfile1;
std::wifstream xmlfile2;

And read them like this:
{
  std::istreambuf_iterator< wchar_t > eos;
  xmlfile1.imbue( utf16_locale );
  xmlfile1.open( szFilename, std::ios::in | std::ios::binary );
  std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile1 ), eos );
}
{
  std::istreambuf_iterator< wchar_t > eos;
  xmlfile2.imbue( utf16_localeLEBOM );
  xmlfile2.open( szFilename, std::ios::in | std::ios::binary );
  std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile2 ), eos );
}

With the first one, the BOM isn't recognized/analyzed, and I end up with a
strXML containing the "raw" content of the file, including the BOM!
With the second one, I end up with the correct string, as I can read it with a
text editor.

If need, I can provide a complete package with test-file.
Thx
Fred

Quuxplusone commented 7 years ago

It seems that libc++ treats "consume_header" as "optionally discard a BOM that
matches my current endianness" instead of "optionally read a BOM to determine
the endianness"

i.e. you have to create the facet with the correct endianness in the first
place, to be able to "consume" the BOM.

The standard says:

— If (Mode & consume_header), the facet shall consume an initial header
sequence,
  if present, when reading a multibyte sequence to determine the endianness of
  the subsequent multibyte sequence to be read.

i.e. the facet must be able to switch endianness based on the BOM, not fix it
at compile-time.

Quuxplusone commented 7 years ago

#include <locale>
#include <codecvt>
#include <iostream>

int main()
{
    {
        const char src[] = "\xFE\xFF\xAB\xCD";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xabcd)
            std::cout << "Failed to process BOM from UTF-16BE string when"
                         " expecting BE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFF\xFE\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xcdab)
            std::cout << "Failed to process BOM from UTF-16LE string when"
                         " expecting BE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFE\xFF\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xabcd)
            std::cout << "Failed to process BOM from UTF-16BE string when"
                         " expecting LE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFF\xFE\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xcdab)
            std::cout << "Failed to process BOM from UTF-16LE string when"
                         " expecting LE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
}

This prints:

Failed to process BOM from UTF-16LE string when expecting BE
Read fffe
Failed to process BOM from UTF-16BE string when expecting LE
Read fffe

i.e. it fails "to determine the endianness of the subsequent multibyte sequence
to be read", instead it unconditionally assumes the endianness is the same as
the codecvt_mode template argument.

Quuxplusone commented 7 years ago

I can reproduce this; it will almost certainly take a change in the dylib to fix.

Quuxplusone / LLVMBugzillaTest

codecvt_utf16 facet with utf-16LE-bom file... #20480