CodeCvt in locales don't work on Windows with multi-byte wide-chars

Flamefire commented 2 years ago

I see a failing test for test_ok<Char>("\xf0\xa0\x82\x8a",g("en_US.UTF-8")); // U+2008A which I traced down to Boost.Locale ultimately using std::codecvt<wchar_t, char, mbstate_t> which on VS 2019 uses _Mbrtowc under the hood which can only decode to a single wchar_t but the above would decode to a surrogate pair. Hence the decoding of that UTF-8 char fails which makes the read fail and the string read from the so-imbued file fails

Flamefire commented 2 years ago

@artyom-beilis Any comment on how to handle this? I.e. simply declare only UCS2 (i.e. UTF codepoints which can be represented in 2 bytes) as supported?

Background: _Mbrtowc is a wrapper around a call to MultiByteToWideChar which relies on the anachronistic assumption that any encoded character can be converted to a single UCS-2 code unit. This is clearly wrong and has been for many years [...]

artyom-beilis commented 2 years ago

Its strange. Because it used to work. The mbstate_t is just a place to store the state. It worked once for sure.

It should support UTF-16

artyom-beilis commented 2 years ago

It should not use std::codecvt<wchar_t, char, mbstate_t> but rather local's codecvt faced derived from it.

Something is wrong. Unofrtunatelly I don't have VC-2019 or even partially working windows dev environment. :-(

artyom-beilis commented 2 years ago

I recall several failures with this test_ok and usually they were related to iostream libraries. IIRC there were some clang issues. It was long time ago

Flamefire commented 2 years ago

It should not use std::codecvt<wchar_t, char, mbstate_t> but rather local's codecvt faced derived from it.

Yes and that is created at https://github.com/boostorg/locale/blob/9259845c6fd95ce6b8bf11e8ec5d8bdf38836c4e/src/std/codecvt.cpp#L37

which for the MS STL yields a small effectively the std::codecvt above (actually a derived class with some insignificant meta data for this issue). So we get to this class: https://github.com/microsoft/STL/blob/0349ce1540c21952733d0f2bf1dfe85f4b9cd749/stl/inc/xlocale#L1909 This uses _Mbrtowc at https://github.com/microsoft/STL/blob/0349ce1540c21952733d0f2bf1dfe85f4b9cd749/stl/inc/xlocale#L1984 And that fails for UTF-16 surrogate pairs: https://github.com/microsoft/STL/blob/20288f94e4b1319bfcf2874ba3a6157b5cc6172c/stl/src/xmbtowc.cpp#L110-L115

The issue is that they simply ignore the state. So the MS std::codecvt<wchar_t, char, mbstate_t> doesn't use UTF-16 but their UCS-2 which is basically unencoded (up to) 2-byte unicode. So codepoints above U+FFFF won't work with the std backend on MS STL

artyom-beilis commented 2 years ago

I don't understand.

I override do_in with code that handles utf16 correctly

artyom-beilis commented 2 years ago

Is this code is called: https://github.com/boostorg/locale/blob/develop/include/boost/locale/generic_codecvt.hpp#L239

Flamefire commented 2 years ago

No it is not. What happens:

g("en_US.UTF-8") in test_ok
install a codepage_facet
create_codecvt
return codecvt_bychar<wchar_t>

The custom codecvt is not used because utf_mode_ = utf8_native_with_wide as std::locale("en_US.UTF-8") works, see https://github.com/boostorg/locale/blob/9259845c6fd95ce6b8bf11e8ec5d8bdf38836c4e/src/std/std_backend.cpp#L114

Maybe the loadable(lid) used to NOT work?

artyom-beilis commented 2 years ago

Maybe the loadable(lid) used to NOT work?

Absolutely!

Under MSVC you couldn't create std::locale("en_US.UTF-8") it looks something new indeed.

Flamefire commented 2 years ago

So what would you like to do? Check to use the Windows codepage name first?

artyom-beilis commented 2 years ago

So what would you like to do? Check to use the Windows codepage name first?

I would do this, in any case of MSVC even if codepage is loadable I'd use utf8_from_wide or at least for covecvt only (in case MSVC really implemented utf-8 locales) since MSVC's UTF-8/Wide codecvt is broken so needs to be replaced.

I'm really-really grateful for an amazing job you do. You really saving this project since I just can't keep-up with all latest bells and whistles from MSVC team.

Flamefire commented 2 years ago

I'm really-really grateful for an amazing job you do. You really saving this project since I just can't keep-up with all latest bells and whistles from MSVC team.

That's why CI is so very important. It also allows comparison between versions of the code and/or compiler to see when things broke. Right now the testsuite fails pretty much everywhere and I barely know enough about the intentions of the code to guess a fix. See e.g. #69 or the runs at #76

Hence my efforts to get at least a minimum of that working. And make sure it at least compiles on all (common) platforms.

Flamefire commented 2 years ago

I double-checked everything and it seems the test is broken and expects to much. See the MS comment at https://github.com/microsoft/STL/blob/20288f94e4b1319bfcf2874ba3a6157b5cc6172c/stl/src/xmbtowc.cpp#L110-L115

// this would result in a UTF-16 surrogate pair, which we can't emit in our // singular output wchar_t, so fail. see N4750 [locale.codecvt.virtuals]/3

And the referenced part of the standard reads:

A codecvt facet that is used by basic_filebuf shall have the property that if do_out(state, from, from_end, from_next, to, to_end, to_next) would return ok, where from != from_end, then do_out(state, from, from + 1, from_next, to, to_end, to_next) shall also return ok, and that if do_in(state, from, from_end, from_next, to, to_end, to_next) would return ok, where to != to_end, then do_in(state, from, from_end, from_next, to, to + 1, to_next) shall also return ok

And a footnote reads:

Informally, this means that basic_filebuf assumes that the mappings from internal to external characters is 1 to N: a codecvt facet that is used by basic_filebuf must be able to translate characters one internal character at a time.

And for surrogates, i.e. the tests at https://github.com/boostorg/locale/blob/d1fc12426ffcc411a8ccddcf9ef6f33a2d5d0609/test/test_codepage.cpp#L127-L137, that isn't possible.

However there is a possible solution:

Note: As a result of operations on state, it can return ok or partial and set from_next == from and to_next != to.

And the char16_t codecvt from MS indeed does make use of that.

So at least we should have a testcase, checking that the above properties hold true for the Boost.Locale codecvts, e.g. https://github.com/boostorg/locale/blob/11f50e9e3c4e068c9a2cc71e78f0dd4a7c28ecd1/include/boost/locale/generic_codecvt.hpp#L239

If that is the case, we can use our codecvt unconditionally for utf-8 wchar.

Flamefire commented 2 years ago

Ok it seems you are not allowed to use a codecvt that "supports" multiple external chars In a std::filebuf according to the standard, I.e. Those multibyte utf-16 codepoints. See the comment from MS at https://github.com/microsoft/STL/issues/2788

We can do "works usually but is actually UB" and use our utf-8 codecvt or remove the tests that test codepoints larger than what fits into a single utf-x char.

@artyom-beilis your library, your decision

artyom-beilis commented 2 years ago

I think work with our codecvt - as long as it works. Many things there done according to the practice rather to the book of standard.

Assuming that the same code handles filebuf (which is reasonable to assume) and if it handles char16_t correctly no reason why wouldn't it work on wchar_t.

Unfortunately standard is horribly broken by design and it is one of the reasons Boost.Locale exists.

So I think better to install our codecvt and give the user power.

Flamefire commented 2 years ago

Assuming that the same code handles filebuf (which is reasonable to assume) and if it handles char16_t correctly no reason why wouldn't it work on wchar_t.

That is the question: Does filebuf handle char16_t correctly? I'd say no:

This is an N:M conversion facet, and cannot be used with std::basic_filebuf

Unfortunately standard is horribly broken by design and it is one of the reasons Boost.Locale exists.

It seems to make working with it hard at least.

Additionally there is this:

std::codecvt<wchar_t, char, std::mbstate_t> | conversion between the system's native wide and the single-byte narrow character sets

And "the system's native wide character set" on Windows is basically UCS-2 which excludes non-BMP characters (i.e. > U+FFFF), isn't it? Well partially at least as the WinAPI functions seem to be expecting UTF-16 (which is a superset of UCS-2) nowadays.

I'm not sure what a user may expect here and what could break in filebuf when we use our UTF-16 codecvt instead of a UCS-2 one.

Also asked on Stackoverflow and in Slack just to make sure we are not missing anything.

boostorg / locale

CodeCvt in locales don't work on Windows with multi-byte wide-chars #71