Open salvoilmiosi opened 3 years ago
Because std::numpunct
is broken by design.
I explain. It shouldn't ever return character but a string. For example in many locales thousands separator is NBSP character and it can't be represented as char
in UTF-8. There are some other cases.
Additionally std::locale
never modifies default locale and keeps it C locale since it can easily break some application (writing to CSV for example) and if you want to format a localised number you tell it explicitly
Also the posix backend does implement a subclass of std::numpunct and ignores all outputs that have a size>1, reverting decimal_point to dot and thousand sep to comma. Why can't it be implemented the same way in icu? I have quickly written a mockup and it works:
template<typename CharType>
struct icu_numpunct : public std::numpunct<CharType> {
typedef std::basic_string<CharType> string_type;
public:
icu_numpunct(icu::Locale const &loc)
{
boost::locale::impl_icu::icu_std_converter<CharType> cnv("UTF-8");
UErrorCode err;
icu::DecimalFormatSymbols syms(loc, err);
decimal_point_ = cnv.std(syms.getSymbol(icu::DecimalFormatSymbols::kDecimalSeparatorSymbol));
if (decimal_point_.size() > 1) {
decimal_point_ = CharType('.');
}
thousands_sep_ = cnv.std(syms.getSymbol(icu::DecimalFormatSymbols::kGroupingSeparatorSymbol));
}
protected:
virtual CharType do_decimal_point() const
{
return *decimal_point_.c_str();
}
virtual CharType do_thousands_sep() const
{
return *thousands_sep_.c_str();
}
private:
string_type decimal_point_;
string_type thousands_sep_;
};
Because in POSIX backend I tried to do the best I can given the tools (sine I didn't format the number on my own). If you want to use some ICU specific API just use ICU directly.
@artyom-beilis I think it is still a good idea to implement this:
in POSIX backend I tried to do the best I can given the tools
--> The above mockup shows that "given the tools" (ICU here) one can implement that with minimal effort in a "good enough" way.
IMO simply returning ", ." is wrong, if that doesn't match what is actually used at all. Returning wrong results is often problematic, so I'd expect an exception or similar here instead if the API can't be fulfilled. Implementing the ICU numpunct with best effort to improve that does sound right to me and I'd be willing to review and ultimately accept a PR for that.
IMO simply returning ", ." is wrong
It is right and I explain why.
Lets look on the code:
std::cout << 1235.63 << " " << boost::locale::as::number << 1235.63 << std::endl
Will print lets say in ru_RU.UTF-8 locale following:
1234.67 1.245,67
Note: numpunct still refer to standard facet that is used for output. If you want to format localized number you use boost::locale::as::number
that uses difference facet and it does not use std::numpunct
The problem is that some external libraries rely on std::numpunct
for number formatting and parsing.
The problem is that some external libraries rely on
std::numpunct
for number formatting and parsing.
Then they do plain wrong since std localization is broken, at point that it actually generated wrong UTF-8 because shown only 1st byte of multi-byte sequence.
I have no reason to fix something broken by design. If you want to format number it is also wrong to use facet because number formatting is much more than just separator for example ١٢ is very valid and good number... Are you supporting it as well?
Why is std::numpunct implemented in all localization backends except icu then? I get it you're trying to use whatever tools the language gives you, but this is inconsistent design. Why don't we just add a boost::locale::numpunct that returns std::string?
This is what I have to do to make decimal_for_cpp compatible with boost.locale, is this "broken by design"?
#include <boost/locale.hpp>
namespace bl = boost::locale;
int main(int argc, char**argv) {
const char *lang = "";
if (argc > 1) {
lang = argv[1];
}
std::locale loc = bl::generator{}(lang);
std::cout.imbue(loc);
std::cout << bl::as::number << 1000.5 << '\n';
auto custom_lbm = bl::localization_backend_manager::global();
#ifdef _WIN32
custom_lbm.select("winapi");
#else
custom_lbm.select("posix");
#endif
auto numloc = bl::generator{custom_lbm}(lang);
auto &numfacet = std::use_facet<std::numpunct<char>>(numloc);
std::cout << "Thousand: " << numfacet.thousands_sep() << '\n';
std::cout << "Decimal: " << numfacet.decimal_point() << '\n';
return 0;
}
It is right and I explain why.
I have to agree to @artyom-beilis here. Check this code:
#include <boost/locale.hpp>
#include <iostream>
int main() {
boost::locale::generator gen;
auto loc = gen("it_IT.UTF-8");
{
auto &facet = std::use_facet<std::numpunct<char>>(loc);
std::cout << facet.thousands_sep() << ' ' << facet.decimal_point() << '\n';
std::cout.imbue(loc);
std::cout << 1234.56 << '\n';
//, .
//1234.56
}
loc = std::locale("it_IT.UTF-8");
{
auto &facet = std::use_facet<std::numpunct<char>>(loc);
std::cout << facet.thousands_sep() << ' ' << facet.decimal_point() << '\n';
std::cout.imbue(loc);
std::cout << 1234.56 << '\n';
//. ,
//1.234,56
}
So from what I can tell this is fully correct. the sep and decimal_point are shown as used.
bl::as::number
may use something different but it doesn't claim to use std::numpunct
, and if it did, then it would be a documentation bug.
@artyom-beilis Is there a std::numpunct
equivalent for bl::as::number
?
Why is std::numpunct implemented in all localization backends except icu then? I get it you're trying to use whatever tools the language gives you, but this is inconsistent design.
This is good point.
The issue that I shouldn't modify std::numpunct
at all and this is I agree is inconsistency it should be C backend regardless of the locale by default.
https://github.com/boostorg/locale/blob/develop/src/util/numeric.hpp#L119
The special case should be all others and not posix
. In any case just to show you an example of what happens if you try to use std::locale
#include <iostream>
#include <locale>
int main()
{
std::cout.imbue(std::locale(""));
std::cout << 1456.34567 << std::endl;
}
The result is:
LANG=ru_RU.UTF-8 ./a.out
1�456,35
Woops, yes: Check this:
auto custom_lbm = boost::locale::localization_backend_manager::global();
custom_lbm.select("posix");
boost::locale::generator gen(custom_lbm);
auto loc = gen("it_IT.UTF-8");
{
auto &facet = std::use_facet<std::numpunct<char>>(loc);
std::cout << facet.thousands_sep() << ' ' << facet.decimal_point() << '\n';
std::cout.imbue(loc);
std::cout << 1234.56 << '\n';
}
It returns
. ,
1234.56
And that IS wrong, isn't it? --> With ICU backend it is correct, with POSIX it is broken
std::cout << 1234.56 << '\n';
you need to use boost::locale::as::number for that kind of formatting.
you need to use boost::locale::as::number for that kind of formatting.
Not if I specifically want to use the std formatting. I.e. pretend to be some component which does not know about Boost.Locale
Actually even for C locale thounthands separator is "," but there is no grouping..
The bug/inconsistency is that I can for example give decimal_separator = "," with de_DE locale on POSIX/WIN/STD backends while it should be "."
Just to warn the participants... I have no idea when I'll be able to fix it. Since it isn't that trivial.
So I suggest to look for a workaround before and not wait to fix :-(
Since it isn't that trivial.
While working at this I now understand why. And the inconsistencies bug me, not sure how to resolve them:
numpunct
, see #174as::number
or as::posix
to switch between localized and classic formatting/parsing, but nothing exists for true/false localizationMaking the std::numpunct
facet return the "classic" values resolves 1. But now there is no good way to return localized values.
One solution is to have a fully separate boost::locale::numpunct
returning localized values as strings instead of chars avoiding the encoding issue and having a classic/posix-std::numpunct
. This matches the use case described in https://github.com/boostorg/locale/issues/174#issuecomment-1564761741 / the documentation
However it breaks existing "workarounds" such as the one described in https://github.com/boostorg/locale/issues/64#issuecomment-873566976 and makes it impossible to use a localized input/output in third-party libs.
Moreover there is no existing solution when/how localized boolalpha input/output should be used.
And in more general: There is currently no way to have third-party libraries (even Boost.LexicalCast) use localized input/output
So regarding the doku example (writing a CSV file) and related: A library implicitly expecting classic-locale but not actually setting it may as well be considered broken.
So idea:
as::posix
/as::number
: as::localized
as::posix
or as::localized
when nothing is specified for a streamas::number
will format bools as numbers (likely breaking change), while as::localized
will use the current translated name and as::posix
will use "true"/"false"boost::locale::numpunct
facet using stringsThis is related to #235: std::numpunct
is not defined for the ICU backend so using std::num_put/num_get
with those locales will use the C locale. This happens when the ICU num_put/num_get
implementations uses the fallback which happens for:
uint64_t
output with numbers larger than INT64_MAX
The 2nd case can be ignored for this.
So while we do definitely want a C/Posix-locale numpunct
for the ICU backend in the first case, we actually would want a custom numpunct
for the fallback 3 & 4
There could be a problem with parsing a number written before: The posix locale doesn't use a grouping, hence no thousand separator. If we try to parse a locale formatted number with the fallback it might fail if the locale used grouping.
So only case 4 can be problematic: If we write using a grouping but have an error while parsing using the fallback C locale parser will fail the parsing.
Case 3 is basically what happens in #235: Such numbers get formatted using the C locale although it might be possible to use locale specific formatting when the separators are single-byte. Note that the approach of using a replacement char doesn't work for the ICU backend as e.g. an uint64_t
written with a replacement char cannot be parsed back by the ICU parser class which expects the locale specific separators.
This code always outputs ", ." regardless of locale. Right now the only way to extract std::numpunct information is to format an arbitrary number to a string and extract its, say, second and sixth character, and that's not a very efficient implementation. I would suggest creating a std::numpunct subclass in icu/numeric.cpp that extracts that information from the
icu::DecimalFormatSymbols
class usinggetSymbol