Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

codecvt_utf16 facet with utf-16LE-bom file... #20480

Open Quuxplusone opened 10 years ago

Quuxplusone commented 10 years ago
Bugzilla Link PR20481
Status NEW
Importance P normal
Reported by Fred (frederic.metrich@free.fr)
Reported on 2014-07-29 07:31:01 -0700
Last modified on 2017-04-26 07:14:27 -0700
Version 4.0
Hardware PC Linux
CC eric@efcs.ca, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com, tstellar@redhat.com, zilla@kayari.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Hi,
It seems, the facet codecvt_utf16 is buggy.

I've a file which is UTF-16LE encoded with BOM.
I define 2 facets like this:
  std::locale utf16_locale( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, std::consume_header > );
  std::locale utf16_localeLEBOM( std::locale(), new std::codecvt_utf16< wchar_t, 0x10ffff, static_cast< std::codecvt_mode >( 5 ) /*std::little_endian | std::consume_header*/ > );

I then define 2 streams like this:
std::wifstream xmlfile1;
std::wifstream xmlfile2;

And read them like this:
{
  std::istreambuf_iterator< wchar_t > eos;
  xmlfile1.imbue( utf16_locale );
  xmlfile1.open( szFilename, std::ios::in | std::ios::binary );
  std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile1 ), eos );
}
{
  std::istreambuf_iterator< wchar_t > eos;
  xmlfile2.imbue( utf16_localeLEBOM );
  xmlfile2.open( szFilename, std::ios::in | std::ios::binary );
  std::wstring strXML( std::istreambuf_iterator< wchar_t >( xmlfile2 ), eos );
}

With the first one, the BOM isn't recognized/analyzed, and I end up with a
strXML containing the "raw" content of the file, including the BOM!
With the second one, I end up with the correct string, as I can read it with a
text editor.

If need, I can provide a complete package with test-file.
Thx
Fred
Quuxplusone commented 7 years ago
It seems that libc++ treats "consume_header" as "optionally discard a BOM that
matches my current endianness" instead of "optionally read a BOM to determine
the endianness"

i.e. you have to create the facet with the correct endianness in the first
place, to be able to "consume" the BOM.

The standard says:

— If (Mode & consume_header), the facet shall consume an initial header
sequence,
  if present, when reading a multibyte sequence to determine the endianness of
  the subsequent multibyte sequence to be read.

i.e. the facet must be able to switch endianness based on the BOM, not fix it
at compile-time.
Quuxplusone commented 7 years ago
#include <locale>
#include <codecvt>
#include <iostream>

int main()
{
    {
        const char src[] = "\xFE\xFF\xAB\xCD";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xabcd)
            std::cout << "Failed to process BOM from UTF-16BE string when"
                         " expecting BE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFF\xFE\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::consume_header>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xcdab)
            std::cout << "Failed to process BOM from UTF-16LE string when"
                         " expecting BE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFE\xFF\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xabcd)
            std::cout << "Failed to process BOM from UTF-16BE string when"
                         " expecting LE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
    {
        const char src[] = "\xFF\xFE\xAB\xCD\0";
        std::wstring_convert<std::codecvt_utf16<char16_t, 0x10FFFF, std::codecvt_mode(std::consume_header|std::little_endian)>, char16_t> conv;
        auto dst = conv.from_bytes(src, src+4);
        if (dst[0] != 0xcdab)
            std::cout << "Failed to process BOM from UTF-16LE string when"
                         " expecting LE\nRead "
                      << std::hex << (unsigned)dst[0] << '\n';
    }
}

This prints:

Failed to process BOM from UTF-16LE string when expecting BE
Read fffe
Failed to process BOM from UTF-16BE string when expecting LE
Read fffe

i.e. it fails "to determine the endianness of the subsequent multibyte sequence
to be read", instead it unconditionally assumes the endianness is the same as
the codecvt_mode template argument.
Quuxplusone commented 7 years ago

I can reproduce this; it will almost certainly take a change in the dylib to fix.