boostorg / locale

Boost.Locale
Boost Software License 1.0
32 stars 70 forks source link

[Q] Meaning of utf8_from_wide/utf8_native_with_wide #171

Closed Flamefire closed 1 year ago

Flamefire commented 1 year ago

Due to a recent issue related to the different handling of facet creation depending on utf8_native_with_wide etc. (now the enum class utf8_support) I wanted to ask for clarification:

utf8_native seems to be unused, it is checked for but never set, so it can be removed, can't it?

What is the intended difference between utf8_native_with_wide and utf8_from_wide?

Current logic seems to be that on Windows utf8_from_wide is used otherwise utf8_native_with_wide is used and when the requested encoding isn't UTF-8 then utf8_none is used on all platforms.

However I'm confused that in numeric.cpp the utf8_*_from_wide classes are used (except for time_put) while the collator uses the *from_wide variant only on Windows.

To me it looks like either the utf8_*_from_wide classes should always be used (which might be a performance issue due to the required 2 conversions) or the standard classes are enough already.

So questions (assuming an UTF-8 locale is requested):

And (generally): Can time_put_from_base be replaced by std::time_put_byname?

Only possible reasoning I can see is:
std::locale("foo.UTF-8") fails but std::locale("foo") or std::locale("Windows-name-of-foo") works, i.e. the standard library does not support the UTF-8 encoding and that has to be emulated.

Is this correct? In that case we would need utf8_support::none (non-UTF-8 locale requested), utf8_support::native and utf8_support::from_wide for when std::locale("foo.UTF-8") works and when it doesn't respectively

artyom-beilis commented 1 year ago

Lets start

Now selection of wide data or ordinary character is done as following:

I must admit that virtually 1/2 of std backend designed to handle various issues and incompatibilities between implementations - but that is what Boost.Locale is designed for.

utf8_native seems to be unused, it is checked for but never set, so it can be removed, can't it?

utf8_native is not in use because all standard libraries that support locales support wide characters as well at this point.

Technically you can remove it, but it may cover in some problematic cases when wide character is not properly supported by std::locale on some systems in future.

So you can mark it "for future use" or remove it. Question is what would happen if some day it is going to be needed.

Any reason not to always use the utf8_codecvt given that it should work for UTF-16 and UTF-32 wchar_ts?

It is an option indeed, since native codecvt should provide same functionality when supported but utf8_codecvt as well.

Same for utf8_collator_from_wide

Collator from wide is used only on Windows MSVC when no utf-8 locale is supported. Native narrow character utf-8 locale is much more efficient and usually works well.

In which case would time_put_from_base/std::time_put_byname fail, that utf8_time_put_from_wide avoids

Basically if std char locale supports time put - it is much more efficient than from wide that is why char is preferred and from what I recall there were no failures,because time put much better designed in comparison to num/money punct.

Separators in numeric cases are defined by char rather than string while for time formatting there are naturally strings that define months and week days - so it is preferred to use native ones.

Example case when numput from wide is better is NBSP character that I can identify and substitute with space,