Open lgritz opened 7 years ago
I updated it slightly with a couple spots where I realized I needed to transfer locale settings from one stream to another.
By the way, it's perfectly acceptable for you to reject this and take the stance that tinyformat doesn't want this complication or any possible performance penalty, and so is just always going to use the global native locale, so be it, and that apps who are concerned about that ought to just set the locale globally. But that makes it very difficult for people writing libraries (who want to use tinyformat internal to the library) who don't think that the library internals have any business either assuming or changing anything about the global state of any apps that use them.
I haven't used locale in production code before today, either! Meaning, I always naively assumed tinyformat would print the same for everybody as it did for me, "123.45". And hence the problem -- a Swedish OSL user had his system locale set to "sv_SE.UTF-8" and it was introducing lots of subtle problems related to numeric literals in his OSL source code. (His immediate bug was related to parsing, but the tinyformat side would have messed up writing the .oso file and many other things.)
I don't have strong feelings about the naming conventions. If you want format()
to always obey the global locale, and have a separate cformat()
(format_c
? something else?) to be hard-coded for classic, that's ok by me. I don't directly expose tinyformat, it's all wrapped by my functions to it won't directly affect my users. This patch is the way I did it in the tinyformat that's embedded in oiio, but whatever you decide for the baseline, I will re-sync with you and adhere to your convention for the tinyformat calls themselves.
BTW, I agree that for the string case, there are potentially two use cases: force C locale, or use the native/global one. I really struggled with this issue myself, but in the end decided (for my wrapper functions) that most people are ignorant about locales and expect the "C" convention and will be surprised to get anything different if they inadvertently use the default calls. The only people who want locale-dependent formatting are the ones intentionally using internationalization, and they can be trusted with the burden of using extra arguments and specially-named functions. That was my rationale, anyway.
most people are ignorant about locales and expect the "C" convention and will be surprised to get anything different if they inadvertently use the default calls
I do buy this reasoning, it seems completely sensible. The thing that makes me pause is that naive use of the stream versions of format
should also behave the same as the string version. Eg, it's pretty common to open a file stream and use tfm::format
to output directly to it.
I suppose we could change all versions of format
to use the C locale, and just provide a format(locale, stream, ...)
overload for people who want to do otherwise. (Or an lformat()
to use the already imbued stream locale, which might be more efficient. I'm not sure what efficiency concerns there are here.)
I'd be for the "all existing versions use C locale, new funcs for explicit", I think that's consistent and probably in line with what 99% of users think is really happening (though they are wrong).
But then I'm not 100% sure how to handle the logistics of the "C" versions that write to existing streams. Does each of those functions, then, need to store the stream's original locale, change it to C, output, then restore? For that matter, is ostream::imbue() thread safe? And will it cause even more chaos if a stream that is everywhere else "localized" suddenly has individual writes (that used tinyformat) adhering to different conventions?
It all hurts my head. I'm fine with us all sleeping on this for a few days to let it percolate, there is not really any urgency to decide.
The good news is that MOST technical users will never see a change one way or the other, since they already have their global locale set to the "C" equivalent -- so none of these choices is likely to change things for anybody who isn't already living on one of those edge cases that's currently broken in some way.
Yes - that's all I was thinking - all stream based format
versions would need to save and restore the original locale. It's certainly pretty ugly internally, though maybe not so different from how I deal with format flags which have the same kind of state change and restore performed on the stream.
Which thread safety issues are you particularly thinking about? The locale itself would be const (I assume), and the use of the stream/streambuf would already be non-threadsafe as per usual. So at a guess it seems fine to me. Could easily be wrong though!
I'm certainly happy to let this percolate for a few days before a final decision.
I didn't have any thread safety issues that I knew about for sure... just musing about unknowns, but I guess you're right, we're already accepting that tinyformat is changing (and hopefully restoring) persistent state about the stream.
I'm coming around to seeing the wisdom of declaring that all the "default" functions should be consistent with each other and use C locale -- whether making a string or outputting to a stream -- and have an additional version that takes an explicit locale if you want anything else (including passing stream.getloc() to cause the output to conform to the stream's previous locale). That has the nice property of ensuring that
format (out, "%g", floatvar);
and
out << format("%g", floatvar);
are 100% identical, regardless of any prior state of the stream. If you want something else, we'll have a
format (out, loc, "%g", floatvar);
???
As a contrarian sort of argument, I think that tinyformat should continue defaulting to the global or stream-specified locale. That's the normal and well established behaviour, rightly or wrongly.
So I see the problem more in terms of a convenient way to use "C" locale for string formatting specifically. Which brings me back to the cformat
previously mentioned.
I'd be inclined to explicitly set the global locale for all CLI tools via std::locale::global. This applies to printf
, std::ostream
as well as tfm::format
.
That leaves a more complicated reality for GUI applications that likely depend on not having the global locale forced to "C" classic. In light of that -- mass-migrating code from cformat
seems justifiable, and at least more explicit about the contract.
Unfortunately I've never had cause to use a locale in production code, so if any of the watchers of this repo have experience and would like to chime in that would be useful.
So let's say you had a library that was concerned with generating valid GLSL source code, being linked into a pointy-clicky internationalised GUI application that happened to have some widgets using GLSL for their shaders. It's not desirable for the GLSL formatting of numbers to be locale-dependent. But the UI portions of the application are required to be localised into Japanese, or whatever.
In this situation the solution was to actually eliminate string formatting (printf and std::iostreams) from the GLSL translating library, along with the locale dependency. Removing the format string was considered a security enhancement - at the time it was also being used as part of the web browser Java stack. No more printf, no more varargs, no more worries. (pre C++11)
So what's the verdict, @c42f ? I think our choices are:
Abandon this PR, you don't want to deal with it. PRO: easy for you. CON: leaves the problems.
Accept the PR as-is --- string formats "C" locale, stream output honors the stream's locale, both have versions that take an explicit locale parameter. PRO: work is done. CON: still admittedly a little weird.
Default varieties of format() always output C locale (regardless of global or stream settings), but also have versions that take an explicit locale parameter. PRO: probably what most users want, hard to screw up for persistent data. CON: Inconsistent with printf and stream I/O.
format() to stream always honors stream locale, format() to string always honors global locale, and a new cformat() is added that always uses C locale. In this case, maybe none need to take an explicit locale parameter? (Would you ever want a locale that is neither the native global one, nor "C"?) PRO: consistent with printf/sprintf. CON: most users probably want C locale always, so why make them jump through hoops, make the minority who are formatting for localized UI be the ones to do extra work and use special functions?
Chris Kulla @fpsunflower has been on vacation all week, but I'd also like to get his opinion before we make any final decisions. He's usually good at spotting any sloppiness in the decisions I'm making.
@lgritz Out of curiosity, are you primarily in a CLI context, or a GUI context?
Both!
My use of tinyformat is in two software packages, and in both cases their heart is a library that's used from inside other applications (used by wide range of commercial, proprietary, and other open source projects), but both packages also contain CLI tools that are extensively used in their own right.
Setting "C" locale is trivial for the CLI tools, but does nothing for the library internals.
Interestingly, one of my projects where I use tinyformat is a programming language, so in that case, it's really crucial that the parsing and generating of output (text representation of floating point numbers, in particular) be canonical rather than locale-dependent. But the language JITs from the library, inside other apps that have extensive UI that may be localized, so I can't count on the global locale being "C".
Right thanks for the ideas, I think the design tradeoffs are becoming clearer. To me it seems that:
With (1) in mind, I think it would be most useful to have two sets of functions
std::locale::global()
std::local::classic()
The difficult decision is then which of these should be called format()
. (3) and (4) argue that format
should use the classic locale, with another function using the locale from the stream state. Some naming ideas - lformat()
formatl()
format_loc()
formatloc()
for "localized format" or "locale format" or "format with localization"?
I'm currently of the opinion that the unadorned format() should always use "C" locale. It seems nearly foolproof, your points 2-3 are very compelling.
I agree that the main use cases are "always C" and "use the global" and both of those should be simple, without having to pass an explicit locale. But it's so easy to also have the version that takes an explicit locale, I'd be inclined to keep that flexibility available in case somebody needs it. It does not compromise the design or simplicity of either of the other two varieties and is truly just a few lines of code.
Of your choices, I like format_loc()
best, but I don't feel very strongly about it (I wrap your functions in other names anyway). It's a little clunky, but perhaps in a good way -- we want people to be very deliberate about when they use it.
Wow, that world map of decimal separator use is a bit of a horror show. I didn't realize the countries were split close to half and half on this decision. I suppose the cost of certain internationalization bugs might be quite severe - for an obvious example, the difference between $100,000
and $100.000
- at least in Australia the ,
is still commonly used as a thousands separator. Clearly everyone should just give up and adopt the Arabic decimal separator ;-)
But it's so easy to also have the version that takes an explicit locale ...
True, though I'm kind of more inclined to add this later if someone actually needs it. Personally I think I'd be much more likely to use the global locale if I wanted locale specific formatting.
At this time, I have no objection to just having the two common options (C and global), and we can add the other if we want it later.
Uh oh. I've got quite bad news about performance. Before we go further I hacked up the speed tests - just on master - so that we save and restore the locale in printf
. Before the changes:
$ make speed_test
g++ -Wall -Werror -std=c++11 -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
running speed tests...
printf timings:
real 1.23
user 1.22
sys 0.00
iostreams timings:
real 1.75
user 1.75
sys 0.00
tinyformat timings:
real 2.10
user 2.10
sys 0.00
boost timings:
real 8.77
user 8.76
sys 0.00
Applying the following patch
$ git diff
diff --git a/Makefile b/Makefile
index 6cef5f5..09417ea 100644
--- a/Makefile
+++ b/Makefile
@@ -38,7 +38,7 @@ tinyformat.html: README.rst
rst2html.py README.rst > tinyformat.html
tinyformat_speed_test: tinyformat.h tinyformat_speed_test.cpp Makefile
- $(CXX) $(CXXFLAGS) -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
+ $(CXX) $(CXXFLAGS) $(CXX11FLAGS) -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
bloat_test:
@for opt in '' '-O3 -DNDEBUG' ; do \
diff --git a/tinyformat.h b/tinyformat.h
index 85a22c1..b5e1c40 100644
--- a/tinyformat.h
+++ b/tinyformat.h
@@ -975,7 +975,9 @@ std::string format(const char* fmt, const Args&... args)
template<typename... Args>
void printf(const char* fmt, const Args&... args)
{
+ std::locale loc = std::cout.imbue(std::locale::classic());
format(std::cout, fmt, args...);
+ std::cout.imbue(loc);
}
template<typename... Args>
gives the following
$ make speed_test
g++ -Wall -Werror -std=c++11 -DTINYFORMAT_USE_VARIADIC_TEMPLATES -O3 -DNDEBUG tinyformat_speed_test.cpp -o tinyformat_speed_test
running speed tests...
printf timings:
real 1.22
user 1.22
sys 0.00
iostreams timings:
real 1.76
user 1.76
sys 0.00
tinyformat timings:
real 3.79
user 3.36
sys 0.42
boost timings:
real 8.53
user 8.53
sys 0.00
It's a bit of a performance disaster.
I was puzzled by the system time. Running strace
reveals that the stream is flushed every time imbue
is called. I think this is the real culprit, as inserting a simple std::cout::flush
into printf results in similar performance degradation in the speed test.
Dammit, there's no winning on this.
What about the string versions? Is there any performance difference?
So, potential strategies to mitigate perf issues:
Default behavior is always to follow the global/stream locale, with a format_c()
that forces C locale and/or format_loc()
that takes an explicit locale.
Default C locale for strings, stream locale for stream output, and overrides for both.
Yeah, it's pretty awful. Taking this into consideration, the best option seems like it might be a new family of cformat()
/ format_c()
functions which force C locale for both string and streams. Or maybe just a single new function cformat()
for the string version as the stream versions already provide locale control (albeit not perhaps what people naively expect).
For the string version, the cost might be somewhat mitigated by the other stuff it's doing, namely allocating string buffers, etc, and flushing doesn't involve a syscall to writev()
... actually so this may be ok. Needs testing (later).
It's all surprisingly messy, and does make me wonder whether locale support in the standard library was a bit of an afterthought.
If using C locale makes the string formatting slow, then I think the only choice is to leave the original format() to use global (for strings) and stream locales, and add format_c/format_loc when people have a specific preference and are willing to pay the cost. (My choice 1 a couple messages ago.)
If the string formatting is the same speed either way (i.e., if the actual stream flush on a real file or console stream is the expensive part), then my choice 2 from earlier is still a possibility, if you want the string format() to do what most users expect and have the special call for locale-intentioned user.
I am cursing the authors of the standard library for this. It's so obvious that locales should have been opt-in on each call, with the default calls being locale-independent.
You want to make a final call on this, Chris? I don't have an especially strong preference, but I'd like to get it settled one way or the other so I can figure out what to do in my downstream projects and not have to change them a second time.
I'm happy to revise my submission to match whatever decision you make.
On the OIIO side, we've made the decision not to expose the location-explicit API calls, but instead to simply declare that since OIIO is all about file I/O at its core, it must be consistent across all users and platforms, therefore all of its numeric input and output are "C" locale.
I think that this is probably not the right decision for tinyformat. My current recommendation is that you make the default (format()
) be fixed to C locale, and add a second format_loc()
that outputs using the native locale, and I think that will cover you. (Alternately, format could use native, and format_c could use C locale.) Let me know how you want to go, for any of these plans, and I'll revise this PR accordingly.
@c42f Poke... let me know how you'd like me to revise this.
Sorry Larry, I've been dithering on making a decision on this because there's no obvious nice choice. But it's still worth making a choice. I'd like to do a little more benchmarking, but my current feeling is that the std library forces us into leaving the family of format
functions as they are for performance and consistency. This means introducing a utility version format_c
which always uses the classic C locale. format_c
should have a string version, and could also have versions which format to a stream for convenience (though one should arguably set the locale on the stream in those cases instead).
As a side note - it's always been my intention to reproduce printf
as closely as possible so that using tinyformat is just a function rename. So letting a stream based setting control the locale is at least consistent there.
Coincidentally I've just had a nice PR in #45 which implements posix argument reordering, again for localization related reasons. @jrohel - since you're apparently in the process of dealing with localization, I'd be interested if you can share your experience as it relates to this PR, if you have some.
Agreed. No performance implications and opt-in for forcing the locale. Sounds good to me.
I've updated the PR with a much less ambitious version. All existing functions are unchanged, but I added a format_c, printf_c, printfln_c that force C locale, even if a little more expensive.
@c42f OK, my experiences: I use tinyformat in a localized application. That means translated texts and localized format of numbers, etc. My experience with tinyformat is good. I only use this my extension https://github.com/c42f/tinyformat/pull/45. Nothing more. Please, do not change default behaviour.
If I set global C++ locale (std::locale::global) then I assume tfm::format(const char* fmt, const Args&... args) will use it.
If I set std::cout locale (std::cout.imbue) then I assume tfm::printf and tfm::printfln will use it.
If I want to produce machine readable text output then I set appropriate format parameters (imbue, ...) on the output stream. In another cases I need more languages at the same time (eg. text translator). Than I set output stream with specific parameters for each language. Machine readable text is only one of the languages. For example in binary world we have similar problems. Instead of lacale we must be aware eg. of endianness.
If you want you can add another function "format_c()" or more general function "format_loc(locale, ...)" - first parameter will be locale. "format_c()" then will be only a special case of "format_loc()". But once again. Please, do not change default behaviour. I think that it behaves logicaly now.
Without all this, you will get situations where format("%g",123.45) might return "123,45" if it happens to be running on a machine with locale "fr_FR", for example. If that is written to an output file that is later read on a computer using a locale with '.' as the decimal mark, it could mis-parse as "123.0", leading to hilarious and tragic results.
It seems like the safe thing is to force the string-returning varieties of tinyformat to use classic locale ('.' decimal), with an optional means for the odd users to purposely specify a different (or native) locale. It also seems that the best policy for the stream-appending varieties of format() is to use the local already associated with those streams (C++ lets each stream have its own locale).
format() to an existing stream (including cout) is now documented to honor the stream's existing locale (as it always did).
format() that returns a string now will always format the string using the classic "C" locale, in particular meaning that floating point numbers will use '.' as the decimal mark, regardless of the "native" locale set by the OS or process that may have an unexpected choice of decimal point.
Add a new variety of format() that returns a string, but which takes an explicit locale, for cases where you don't want the "C" locale (for example, when generating strings that will appear in a UI that you want localized properly). You can make this variety use the current locale by passing std::locale() as the first argument (the default constructor of std::locale just grabs the global process locale).