Closed NN--- closed 3 years ago
Your implementation of to_hex
is not equivalent because it always produces unnecessary leading zeros. That said fmt's hex formatter is indeed quite suboptimal but it's easy to optimize. It's just a matter of adding a fast path in https://github.com/fmtlib/fmt/blob/17fba753c45b75848f0281c64be89a908275e7f8/include/fmt/compile.h#L537-L540 and using format string compilation. PRs are welcome.
Fortunate, easy to compare, 'cause it is 10 times slower than sprintf: https://godbolt.org/z/cPqMz3
@BuzaL Interestingly, if I use FMT_COMPILE
and use -std=c++17 -O3 -flto
, then fmt::format
is faster than snprintf
by roughly 3 time.
As I see, its an issue with 7.x.x. The 6.2.1 still faster then reserve + append, but 7.0 and trunk are slow with ANY formatting. That means something went wrong the on-the-fly format compiler.
Looks like the regression is specific to Compiler Explorer, most likely the installed version is compiled with optimizations disabled.
Confirmed in https://github.com/fmtlib/fmt/issues/2153#issuecomment-786302075. Don't use Compiler Explorer for benchmarking =). Also don't use system clock.
I've landed a few formatting optimizations that improve hex formatting perf among other things.
With these changes, format string compilation and adding size computation to the manual method for consistency
#include <string>
#include <iostream>
#include <chrono>
#include <fmt/compile.h>
constexpr char hexchar[] = "0123456789ABCDEF";
template <typename T>
std::string to_hex(T value)
{
constexpr size_t sz = sizeof(value) * 2;
std::string buffer(sz, L'\0');
int size = 0;
for (int i = 0; i < 2 * sizeof(value); i++) {
int digit = (value >> 4 * (sz - 1 - i)) & 0xf;
buffer[i] = hexchar[digit];
if (digit != 0) size = digit;
}
buffer.resize(size + 1);
return buffer;
}
template <typename T>
std::string fmt_to_hex(T value)
{
return fmt::format(FMT_COMPILE("{:X}"), value);
}
int main()
{
int sum = 0;
auto before = std::chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
auto a = to_hex(i);
sum += a.length();
}
auto manual = std::chrono::steady_clock::now() - before;
before = std::chrono::steady_clock::now();
for (int i = 0; i < 100000000; i++)
{
auto a = fmt_to_hex(i);
sum += a.length();
}
auto fmt = std::chrono::steady_clock::now() - before;
std::cout << "Manual: " << std::chrono::duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
std::cout << "Fmt: " << std::chrono::duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;
return sum;
}
we can get comparable performance:
$ g++ test.cc -DNDEBUG -O3 -I include src/format.cc -std=c++17 -flto
$ ./a.out
Manual: 1099
Fmt: 867
Note that the results are not very reliable, I recommend using a benchmarking framework instead.
I prepared a bit different still fitting benchmark of hex formatting.
It doesn't check fmt::format()
or to_hex
from the top message that returns std::string
, as well as std::stringstream
and std::to_string
, because all these approaches introduce additional allocations that may influence on actual formatting. But it does check formatting to the existing buffer, so there are several versions of fmt::format_to
, std::to_chars
, and custom code (it copied from {fmt} but doesn't introduce any complexity that format_to
does).
There are tests named FMTCustomFormatter*
which have a slightly modified version of fmt/compile.h
that provides a custom int_formatter
class for formatting integer types. It's the initial test of my idea of formatter being specific for passed specs, it was previously described here - https://github.com/fmtlib/fmt/issues/2078.
Let me paste some of the results here (full results available in README.md, also you can check it on your PC):
Benchmark | Time |
---|---|
CustomCode | 2.93 ns |
FMTCompileMaster | 12.5 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 3.66 ns |
ToChars | 2.57 ns |
Thanks for the benchmark. I wonder why the results for FMTCompileMaster
and FMTCompileMasterNOINLINERemovedFromFill
are slightly different in https://github.com/alexezeder/fmt_bnchmrk if fill is not called? Maybe a measurement error or some difference in codegen?
I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk_legacy if fill is not called? Maybe a measurement error or some difference in codegen?
Oops... I made a mistake at the last state of preparing this repo and FMTCompileMasterNOINLINERemovedFromFill
actually has the same code as FMTCompileMaster
. I updated the benchmark repo, but they are still really close, almost equal.
There is a difference in codegen of course, but looks like as long as fill()
function is not used this difference does not really affect performance. So it's probably a measurement error, something like ±0.3 ns which is not bad on my opinion.
EDIT: repo moved from https://github.com/alexezeder/fmt_bnchmrk to https://github.com/alexezeder/fmt_bnchmrk_legacy
Regarding FMTCompileSeparateFormatterNOINLINERemovedFromFill
, it is a nice speedup but I wonder if we can avoid replicating the format specifier parsing logic. In principle the compiler should be able to optimize basic_format_specs
handling since the specs are constexpr. The LTO results seem to confirm that but the question is how to make it work without LTO.
In principle the compiler should be able to optimize
basic_format_specs
handling since the specs are constexpr.
My bet here is that the code is just too complex for the compiler to optimize. We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases. I mean, I'm not sure that even the formatter object that is created at compile-time can be used inside format
method of spec_field
at runtime in the most efficient way, i.e. inlined somehow.
The LTO results seem to confirm that but the question is how to make it work without LTO.
I'm not sure how LTO should help in this case, at least for the results in the benchmark. It uses header-only versions of {fmt}, so looks like it could help only with reducing the size of the executable by deleting some of non-inlined functions from different versions. In fact, GCC with -flto did it for my precious benchmark, so I had to build it several times to get the results of different versions. But here are results for LTO and non-LTO builds: |
Master | SeparatedFormatter | |
---|---|---|---|
0 | 11.5 ns (w/o LTO), 10.7 ns (with LTO) | 2.44 ns (w/o LTO), 2.20 ns (with LTO) | |
42 | 12.0 ns (w/o LTO), 11.0 ns (with LTO) | 3.06 ns (w/o LTO), 3.25 ns (with LTO) | |
273123 | 13.2 ns (w/o LTO), 12.7 ns (with LTO) | 6.11 ns (w/o LTO), 7.10 ns (with LTO) | |
9223372036854775807 | 14.3 ns (w/o LTO), 13.5 ns (with LTO) | 7.89 ns (w/o LTO), 10.1 ns (with LTO) |
And another set of benchmarks. They are also available in my benchmark repository.
Clang was added, it shows even better results for SeparatedFormatter tests:
Benchmark, value == 273123 |
GCC | Clang |
---|---|---|
CustomCode | 6.15 ns | 5.03 ns |
FMTCompileMaster | 13.3 ns | 14.7 ns |
FMTCompileMasterNOINLINERemovedFromFill | 13.2 ns | 14.8 ns |
FMTCompileSeparateFormatter | 14.2 ns | 13.4 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 7.49 ns | 5.19 ns |
ToChars | 3.76 ns | 3.44 ns |
For Clang there is no overhead of using code from FMTCompileSeparateFormatterNOINLINERemovedFromFill
compare to custom code.
To test what I wrote earlier:
We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases.
I modified all functions of {fmt} (which are invoked by the test code) to add to them always_inline
attribute. This forces compiler to make inlining even in case if it decides that the function is too complex. Of course, fill
function left untouched for FMTCompileMaster
and FMTCompileSeparateFormatter
versions of {fmt}. Here are the results (FI stands for force inline version):
Benchmark, value == 42 |
GCC* | GCC* FI | Clang | Clang FI |
---|---|---|---|---|
CustomCode | 2.79 ns | 2.79 ns | 2.22 ns | 2.24 ns |
FMTCompileMaster | 12.0 ns | 11.8 ns | 13.3 ns | 12.8 ns |
FMTCompileMasterNOINLINERemovedFromFill | 12.0 ns | 11.8 ns | 13.4 ns | 12.7 ns |
FMTCompileSeparateFormatter | 11.5 ns | 4.83 ns | 10.6 ns | 10.6 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 3.53 ns | 3.31 ns | 2.22 ns | 2.26 ns |
ToChars | 2.49 ns | 2.01 ns | 2.21 ns | 2.23 ns |
Benchmark, value == 9223372036854775807 |
GCC* | GCC* FI | Clang | Clang FI |
---|---|---|---|---|
CustomCode | 6.82 ns | 6.85 ns | 6.47 ns | 6.58 ns |
FMTCompileMaster | 14.5 ns | 14.5 ns | 16.0 ns | 15.5 ns |
FMTCompileMasterNOINLINERemovedFromFill | 14.4 ns | 14.5 ns | 16.0 ns | 15.5 ns |
FMTCompileSeparateFormatter | 16.6 ns | 11.6 ns | 15.4 ns | 15.6 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 10.5 ns | 9.80 ns | 6.49 ns | 6.55 ns |
ToChars | 5.45 ns | 4.54 ns | 4.92 ns | 5.00 ns |
Looks like this trick can only slice some time for the FMTCompileSeparateFormatter
test on GCC, but for the current approach in {fmt} it does nothing.
Since this issue started with std::string
based test, I decided to make a proper benchmark for this case. I took code from the first message, and since this code fills the result with leading zeros (up to 8 digits), the format string for {fmt}-based test is {:0>8x}
. Also, I replaced ToChars
test with StringStream
test, just to be sure that stringstream
still sucks.
Benchmark, value == 42 |
GCC | Clang |
---|---|---|
CustomCode | 3.51 ns | 3.24 ns |
FMTCompileMaster | 36.4 ns | 28.1 ns |
FMTCompileMasterNOINLINERemovedFromFill | 36.5 ns | 27.9 ns |
FMTCompileSeparateFormatter | 19.8 ns | 18.5 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 14.9 ns | 18.8 ns |
StringStream | 311 ns | 322 ns |
Considering these results looks like {fmt} has a big overhead, but that's quite false. Code from the first message is actually closer to the usage of fmt::format_to
because the buffer
size is available at compile-time. Just by changing this code like that:
std::string test_custom_code(unsigned value) {
constexpr unsigned width = 2 * sizeof(value);
std::string buffer;
for (int i = 0; i < width; i++)
buffer.append(1, hexchar[(value >> 4 * (width - 1 - i)) & 0xf]);
return buffer;
}
the results from GCC (Clang still optimizes this code) become comparable: | Benchmark, value == 273123 |
Time |
---|---|---|
CustomCode | 15.6 ns | |
FMTCompileMaster | 35.4 ns | |
FMTCompileMasterNOINLINERemovedFromFill | 35.4 ns | |
FMTCompileSeparateFormatter | 19.4 ns | |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 14.7 ns | |
StringStream | 321 ns |
AFAICS the part that prevents optimization for some reason is dynamic width/precision handling: https://github.com/fmtlib/fmt/blob/0f85a4683a137fa76ca7573ffbf4f7b118040b80/include/fmt/format.h#L3540-L3544
Temporarily commenting it out makes gcc do the expected optimization:
-----------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------------------
CustomCode/0 1.40 ns 1.40 ns 499241121
CustomCode/42 1.99 ns 1.99 ns 350273792
CustomCode/273123 4.93 ns 4.93 ns 143643778
CustomCode/9223372036854775807 5.48 ns 5.48 ns 125456338
FMTCompileMaster/0 1.20 ns 1.20 ns 582823513
FMTCompileMaster/42 1.59 ns 1.59 ns 437414846
FMTCompileMaster/273123 2.89 ns 2.89 ns 238470638
FMTCompileMaster/9223372036854775807 3.89 ns 3.89 ns 180152587
FMTCompileMasterNOINLINERemovedFromFill/0 9.17 ns 9.17 ns 75675294
FMTCompileMasterNOINLINERemovedFromFill/42 9.57 ns 9.57 ns 72413567
FMTCompileMasterNOINLINERemovedFromFill/273123 10.6 ns 10.6 ns 65808614
FMTCompileMasterNOINLINERemovedFromFill/9223372036854775807 11.6 ns 11.6 ns 60052123
FMTCompileSeparateFormatter/0 7.71 ns 7.71 ns 89553288
FMTCompileSeparateFormatter/42 8.52 ns 8.52 ns 81365037
FMTCompileSeparateFormatter/273123 10.7 ns 10.7 ns 64955329
FMTCompileSeparateFormatter/9223372036854775807 12.0 ns 12.0 ns 58023609
FMTCompileSeparateFormatterNOINLINERemovedFromFill/0 1.91 ns 1.91 ns 368577999
FMTCompileSeparateFormatterNOINLINERemovedFromFill/42 2.63 ns 2.63 ns 265312761
FMTCompileSeparateFormatterNOINLINERemovedFromFill/273123 5.57 ns 5.57 ns 126824185
FMTCompileSeparateFormatterNOINLINERemovedFromFill/9223372036854775807 8.26 ns 8.26 ns 84841458
ToChars/0 1.00 ns 1.00 ns 667463669
ToChars/42 1.99 ns 1.99 ns 350416544
ToChars/273123 2.99 ns 2.99 ns 233728063
ToChars/9223372036854775807 4.32 ns 4.32 ns 161760882
Need to figure out what specifically gcc doesn't like about the above code snippet.
Temporarily commenting it out makes gcc do the expected optimization:
Wow, these results are great! I didn't expect that compilers can optimize code like that.
This change makes FMTCompileMaster
timings even better than timings for std::to_chars
, despite {fmt} is way more generic. But that's true for GCC, not for Clang. On Clang, FMTCompileMaster
timings become better with this change, but it's still slower than FMTCompileSeparateFormatterNOINLINERemovedFromFill
. Here are combined results (DSH - dynamic specs handling):
Benchmark, value == 0 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 1.76 ns | 1.75 ns | 1.59 ns | 1.55 ns |
FMTCompileMaster | 1.50 ns | 11.5 ns | 6.82 ns | 12.9 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 2.54 ns | 2.52 ns | 1.59 ns | 1.56 ns |
ToChars | 1.24 ns | 1.25 ns | 1.25 ns | 1.89 ns |
Benchmark, value == 42 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 2.55 ns | 2.79 ns | 2.26 ns | 2.22 ns |
FMTCompileMaster | 2.00 ns | 12.0 ns | 7.49 ns | 13.3 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 3.80 ns | 3.53 ns | 2.26 ns | 2.22 ns |
ToChars | 2.90 ns | 2.49 ns | 2.25 ns | 2.21 ns |
Benchmark, value == 273123 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 5.81 ns | 6.15 ns | 5.19 ns | 5.03 ns |
FMTCompileMaster | 3.34 ns | 13.3 ns | 8.95 ns | 14.7 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 8.26 ns | 7.49 ns | 5.51 ns | 5.19 ns |
ToChars | 4.50 ns | 3.76 ns | 3.50 ns | 3.44 ns |
Benchmark, value == 9223372036854775807 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 6.95 ns | 6.82 ns | 6.72 ns | 6.47 ns |
FMTCompileMaster | 4.86 ns | 14.5 ns | 10.6 ns | 16.0 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 11.3 ns | 10.5 ns | 6.67 ns | 6.49 ns |
ToChars | 5.63 ns | 5.45 ns | 5.02 ns | 4.92 ns |
For some reason FMTCompileSeparateFormatterNOINLINERemovedFromFill
is slower than FMTCompileMaster
on GCC. I mean, they share the same {fmt} code, and int_formatter
is way simpler than the current one 😐. But okay, I would investigate what's going on there.
In addition, I did a quick check for a bit more complex format string 0>8x
. Here are also combined results:
Benchmark, value == 0 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 4.07 ns | 4.48 ns | 5.27 ns | 5.23 ns |
FMTCompileMaster | 21.5 ns | 27.8 ns | 14.2 ns | 19.2 ns |
FMTCompileMasterNOINLINERemovedFromFill | 4.12 ns | 27.7 ns | 14.2 ns | 19.2 ns |
FMTCompileSeparateFormatter | 11.6 ns | 11.0 ns | 9.72 ns | 9.97 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 5.60 ns | 5.97 ns | 4.95 ns | 5.48 ns |
Benchmark, value == 42 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 4.32 ns | 5.61 ns | 6.15 ns | 6.08 ns |
FMTCompileMaster | 21.9 ns | 28.8 ns | 14.8 ns | 19.6 ns |
FMTCompileMasterNOINLINERemovedFromFill | 4.84 ns | 28.6 ns | 15.0 ns | 19.6 ns |
FMTCompileSeparateFormatter | 11.8 ns | 11.8 ns | 10.7 ns | 10.7 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 5.96 ns | 6.21 ns | 6.29 ns | 5.99 ns |
Benchmark, value == 273123 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 6.91 ns | 8.07 ns | 8.51 ns | 8.44 ns |
FMTCompileMaster | 23.2 ns | 29.7 ns | 16.3 ns | 20.9 ns |
FMTCompileMasterNOINLINERemovedFromFill | 5.72 ns | 29.9 ns | 16.2 ns | 21.0 ns |
FMTCompileSeparateFormatter | 14.7 ns | 15.1 ns | 12.6 ns | 13.5 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 7.98 ns | 8.06 ns | 9.37 ns | 8.30 ns |
Benchmark, value == 9223372036854775807 |
GCC w/o DSH | GCC with DSH | Clang w/o DSH | Clang with DSH |
---|---|---|---|---|
CustomCode | 7.53 ns | 9.28 ns | 8.38 ns | 8.28 ns |
FMTCompileMaster | 20.5 ns | 27.2 ns | 14.0 ns | 19.3 ns |
FMTCompileMasterNOINLINERemovedFromFill | 5.07 ns | 27.5 ns | 13.9 ns | 19.0 ns |
FMTCompileSeparateFormatter | 16.9 ns | 16.7 ns | 14.9 ns | 15.7 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 9.38 ns | 9.33 ns | 11.7 ns | 10.6 ns |
Ok, I realized why FMTCompileSeparateFormatter*
are so slow compared to FMTCompileMaster*
- it was probably a bad idea to take different revisions of {fmt} without rebasing separate formatter test code onto the master
version. Especially when format.h
was updated in master
(integral formatting refactoring).
Results for {:x} format string: |
Benchmark, value == 42 |
GCC | Clang |
---|---|---|---|
CustomCode | 2.51 ns | 2.27 ns | |
FMTCompileMaster | 1.99 ns | 7.41 ns | |
FMTCompileMasterNOINLINERemovedFromFill | 2.00 ns | 7.17 ns | |
FMTCompileSeparateFormatter | 2.75 ns | 2.09 ns | |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 2.76 ns | 2.08 ns | |
ToChars | 2.10 ns | 2.27 ns |
Results for {:0>8x} format string: |
Benchmark, value == 42 |
GCC | Clang |
---|---|---|---|
CustomCode | 4.27 ns | 6.19 ns | |
FMTCompileMaster | 22.3 ns | 14.8 ns | |
FMTCompileMasterNOINLINERemovedFromFill | 4.82 ns | 15.0 ns | |
FMTCompileSeparateFormatter | 8.72 ns | 6.85 ns | |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 5.95 ns | 5.32 ns |
Still, FMTCompileMaster
is a bit faster than FMTCompileSeparateFormatter
on GCC, but it's either poor optimization for a non-type template parameter or some optimization magic which cannot be solved.
Anyway, even if those handle_dynamic_spec()
calls cannot be optimized out sanely, then I think we still can have that static only specs optimization in {fmt}. This solution requires a new formatter that handles only basic_
not dynamic_
format specs. Also, there is a problem of detecting dynamic format specs on the string parsing step, because we cannot just look for {
and }
in specs to determine it is dynamic or static. But we can create the default formatter
and look at the context argument identifier after calling parse()
method of formatter
.
Probably the final results for hexadecimal formatting.
I ran the same benchmarks for {:x}
and {:0>8x}
format strings, but this time they were compiled using MSVC (version 19.28.29913), since this compiler is also popular. Again, it's C++20 standard, it's a release configuration, everything remains the same. I also added results from GCC (Cygwin) on Windows just to demonstrate that GCC results on Linux and on Windows are almost identical.
Results for {:x}
format string:
Benchmark, value == 42 |
MSVC | GCC (Cygwin) | GCC | Clang |
---|---|---|---|---|
CustomCode | 4.14 ns | 2.78 ns | 2.51 ns | 2.27 ns |
FMTCompileMaster | 38.5 ns | 2.52 ns | 1.99 ns | 7.41 ns |
FMTCompileMasterNOINLINERemovedFromFill | 38.5 ns | 2.02 ns | 2.00 ns | 7.17 ns |
FMTCompileSeparateFormatter | 7.87 ns | 2.52 ns | 2.75 ns | 2.09 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 7.89 ns | 2.85 ns | 2.76 ns | 2.08 ns |
ToChars | 8.07 ns | 2.02 ns | 2.10 ns | 2.27 ns |
Results for {:0>8x}
format string:
Benchmark, value == 42 |
MSVC | GCC (Cygwin) | GCC | Clang |
---|---|---|---|---|
CustomCode | 6.85 ns | 4.78 ns | 4.27 ns | 6.19 ns |
FMTCompileMaster | 45.8 ns | 27.4 ns | 22.3 ns | 14.8 ns |
FMTCompileMasterNOINLINERemovedFromFill | 45.9 ns | 5.37 ns | 4.82 ns | 15.0 ns |
FMTCompileSeparateFormatter | 16.7 ns | 15.4 ns | 8.72 ns | 6.85 ns |
FMTCompileSeparateFormatterNOINLINERemovedFromFill | 16.7 ns | 6.64 ns | 5.95 ns | 5.32 ns |
Need to figure out what specifically gcc doesn't like about the above code snippet.
I dig a little into the handle_dynamic_spec()
, and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns). But the biggest impact comes from basic_format_args
and it's get()
method (~5 ns). Note that these observations are not very precise, because they actually just cannot be precise because we are talking about compiler optimizations here with all that magic under the hood.
Thanks for such a comprehensive testing!
I didn't expect that compilers can optimize code like that.
In principle this is mostly inlining and constant propagation. I'm more surprised that clang does a poor job here. I'll try to dig into the generated assembly to see what's going on.
I think we still can have that static only specs optimization in {fmt}.
Right.
I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns).
True but it should be optimized away completely if dynamic specs are not used.
@alexezeder That's a nice benchmarks indeed.
I hope you'll add a std::format
to this contest, when it's ready.
After a recent series of optimizations the perf is on par with the handwritten code. Here are the results on clang and the current master
(https://github.com/fmtlib/fmt/commit/d1aebdbde0fc59c7aa047af19ac58784f8a99e1d):
------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0 1.86 ns 1.86 ns 351709307
CustomCode/42 3.03 ns 3.02 ns 244754389
CustomCode/273123 5.45 ns 5.43 ns 130468007
CustomCode/9223372036854775807 6.79 ns 6.77 ns 102097372
FMTCompileMaster/0 1.99 ns 1.99 ns 360523686
FMTCompileMaster/42 2.33 ns 2.33 ns 279865664
FMTCompileMaster/273123 3.72 ns 3.71 ns 190230315
FMTCompileMaster/9223372036854775807 5.28 ns 5.26 ns 130711631
ToChars/0 4.42 ns 4.41 ns 160196630
ToChars/42 5.00 ns 4.98 ns 140735201
ToChars/273123 7.26 ns 7.24 ns 95784130
ToChars/9223372036854775807 8.77 ns 8.75 ns 75872534
For comparison here's are results on an old version (7.1.3):
Running ./fmt_test
Run on (8 X 2800 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 22.02, 12.70, 8.66
------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0 1.81 ns 1.81 ns 369289856
CustomCode/42 2.66 ns 2.63 ns 277249683
CustomCode/273123 5.30 ns 5.27 ns 131162285
CustomCode/9223372036854775807 6.92 ns 6.86 ns 101821144
FMTCompileOld/0 15.5 ns 15.5 ns 43302898
FMTCompileOld/42. 16.6 ns 16.6 ns 43278267
FMTCompileOld/273123 18.7 ns 18.6 ns 37035861
FMTCompileOld/9223372036854775807 19.4 ns 19.4 ns 35243000
ToChars/0 4.33 ns 4.31 ns 162237581
ToChars/42 5.50 ns 5.44 ns 123228589
ToChars/273123 8.47 ns 8.16 ns 92972600
ToChars/9223372036854775807 9.84 ns 9.52 ns 63743569
Manual implementation is about 10 times faster than fmtlib. MSVC x64