Optimization for hexadecimal formatting

fmtlib / fmt

A modern formatting library

https://fmt.dev

Other

20.78k stars 2.5k forks source link

Optimization for hexadecimal formatting #1944

Closed NN--- closed 3 years ago

NN--- commented 4 years ago

Manual implementation is about 10 times faster than fmtlib. MSVC x64

Manual: 2414
Fmt: 16481

#include <string>
#include <iostream>
#include <chrono>
#include <fmt/format.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    for (int i = 0; i < 2 * sizeof(value); i++)
        buffer[i] = hexchar[(value >> 4 * (sz - 1 - i)) & 0xf];
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_STRING("{0:X}"), value);
}

int main()
{
    int sum = 0;

    auto before =  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual =  std::chrono::system_clock::now() - before;

    before=  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt=  std::chrono::system_clock::now() -before;

    std::cout << "Manual: " << duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}

vitaut commented 4 years ago

Your implementation of to_hex is not equivalent because it always produces unnecessary leading zeros. That said fmt's hex formatter is indeed quite suboptimal but it's easy to optimize. It's just a matter of adding a fast path in https://github.com/fmtlib/fmt/blob/17fba753c45b75848f0281c64be89a908275e7f8/include/fmt/compile.h#L537-L540 and using format string compilation. PRs are welcome.

BuzaL commented 3 years ago

Fortunate, easy to compare, 'cause it is 10 times slower than sprintf: https://godbolt.org/z/cPqMz3

NobodyXu commented 3 years ago

@BuzaL Interestingly, if I use FMT_COMPILE and use -std=c++17 -O3 -flto, then fmt::format is faster than snprintf by roughly 3 time.

https://godbolt.org/z/9Ejjoc

BuzaL commented 3 years ago

As I see, its an issue with 7.x.x. The 6.2.1 still faster then reserve + append, but 7.0 and trunk are slow with ANY formatting. That means something went wrong the on-the-fly format compiler.

vitaut commented 3 years ago

Looks like the regression is specific to Compiler Explorer, most likely the installed version is compiled with optimizations disabled.

vitaut commented 3 years ago

Confirmed in https://github.com/fmtlib/fmt/issues/2153#issuecomment-786302075. Don't use Compiler Explorer for benchmarking =). Also don't use system clock.

vitaut commented 3 years ago

I've landed a few formatting optimizations that improve hex formatting perf among other things.

With these changes, format string compilation and adding size computation to the manual method for consistency

#include <string>
#include <iostream>
#include <chrono>
#include <fmt/compile.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    int size = 0;
    for (int i = 0; i < 2 * sizeof(value); i++) {
        int digit = (value >> 4 * (sz - 1 - i)) & 0xf;
        buffer[i] = hexchar[digit];
        if (digit != 0) size = digit;
    }
    buffer.resize(size + 1);
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_COMPILE("{:X}"), value);
}

int main()
{
    int sum = 0;

    auto before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual = std::chrono::steady_clock::now() - before;

    before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt = std::chrono::steady_clock::now() - before;

    std::cout << "Manual: " << std::chrono::duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << std::chrono::duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}

we can get comparable performance:

$ g++ test.cc -DNDEBUG -O3 -I include src/format.cc -std=c++17 -flto
$ ./a.out 
Manual: 1099
Fmt: 867

Note that the results are not very reliable, I recommend using a benchmarking framework instead.

alexezeder commented 3 years ago

I prepared a bit different still fitting benchmark of hex formatting. It doesn't check fmt::format() or to_hex from the top message that returns std::string, as well as std::stringstream and std::to_string, because all these approaches introduce additional allocations that may influence on actual formatting. But it does check formatting to the existing buffer, so there are several versions of fmt::format_to, std::to_chars, and custom code (it copied from {fmt} but doesn't introduce any complexity that format_to does). There are tests named FMTCustomFormatter* which have a slightly modified version of fmt/compile.h that provides a custom int_formatter class for formatting integer types. It's the initial test of my idea of formatter being specific for passed specs, it was previously described here - https://github.com/fmtlib/fmt/issues/2078. Let me paste some of the results here (full results available in README.md, also you can check it on your PC):

Benchmark	Time
CustomCode	2.93 ns
FMTCompileMaster	12.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.66 ns
ToChars	2.57 ns

vitaut commented 3 years ago

Thanks for the benchmark. I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk if fill is not called? Maybe a measurement error or some difference in codegen?

alexezeder commented 3 years ago

I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk_legacy if fill is not called? Maybe a measurement error or some difference in codegen?

Oops... I made a mistake at the last state of preparing this repo and FMTCompileMasterNOINLINERemovedFromFill actually has the same code as FMTCompileMaster. I updated the benchmark repo, but they are still really close, almost equal.

There is a difference in codegen of course, but looks like as long as fill() function is not used this difference does not really affect performance. So it's probably a measurement error, something like ±0.3 ns which is not bad on my opinion.

EDIT: repo moved from https://github.com/alexezeder/fmt_bnchmrk to https://github.com/alexezeder/fmt_bnchmrk_legacy

vitaut commented 3 years ago

Regarding FMTCompileSeparateFormatterNOINLINERemovedFromFill, it is a nice speedup but I wonder if we can avoid replicating the format specifier parsing logic. In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr. The LTO results seem to confirm that but the question is how to make it work without LTO.

alexezeder commented 3 years ago

In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr.

My bet here is that the code is just too complex for the compiler to optimize. We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases. I mean, I'm not sure that even the formatter object that is created at compile-time can be used inside format method of spec_field at runtime in the most efficient way, i.e. inlined somehow.

The LTO results seem to confirm that but the question is how to make it work without LTO.

I'm not sure how LTO should help in this case, at least for the results in the benchmark. It uses header-only versions of {fmt}, so looks like it could help only with reducing the size of the executable by deleting some of non-inlined functions from different versions. In fact, GCC with `-flto` did it for my precious benchmark, so I had to build it several times to get the results of different versions. But here are results for LTO and non-LTO builds:		Master
0	11.5 ns (w/o LTO), 10.7 ns (with LTO)	2.44 ns (w/o LTO), 2.20 ns (with LTO)
42	12.0 ns (w/o LTO), 11.0 ns (with LTO)	3.06 ns (w/o LTO), 3.25 ns (with LTO)
273123	13.2 ns (w/o LTO), 12.7 ns (with LTO)	6.11 ns (w/o LTO), 7.10 ns (with LTO)
9223372036854775807	14.3 ns (w/o LTO), 13.5 ns (with LTO)	7.89 ns (w/o LTO), 10.1 ns (with LTO)

alexezeder commented 3 years ago

And another set of benchmarks. They are also available in my benchmark repository.

Clang

Clang was added, it shows even better results for SeparatedFormatter tests:

Benchmark, `value == 273123`	GCC	Clang
CustomCode	6.15 ns	5.03 ns
FMTCompileMaster	13.3 ns	14.7 ns
FMTCompileMasterNOINLINERemovedFromFill	13.2 ns	14.8 ns
FMTCompileSeparateFormatter	14.2 ns	13.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.49 ns	5.19 ns
ToChars	3.76 ns	3.44 ns

For Clang there is no overhead of using code from FMTCompileSeparateFormatterNOINLINERemovedFromFill compare to custom code.

Force inlining

To test what I wrote earlier:

We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases.

I modified all functions of {fmt} (which are invoked by the test code) to add to them always_inline attribute. This forces compiler to make inlining even in case if it decides that the function is too complex. Of course, fill function left untouched for FMTCompileMaster and FMTCompileSeparateFormatter versions of {fmt}. Here are the results (FI stands for force inline version):

Benchmark, `value == 42`	GCC*	GCC* FI	Clang	Clang FI
CustomCode	2.79 ns	2.79 ns	2.22 ns	2.24 ns
FMTCompileMaster	12.0 ns	11.8 ns	13.3 ns	12.8 ns
FMTCompileMasterNOINLINERemovedFromFill	12.0 ns	11.8 ns	13.4 ns	12.7 ns
FMTCompileSeparateFormatter	11.5 ns	4.83 ns	10.6 ns	10.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.53 ns	3.31 ns	2.22 ns	2.26 ns
ToChars	2.49 ns	2.01 ns	2.21 ns	2.23 ns

Benchmark, `value == 9223372036854775807`	GCC*	GCC* FI	Clang	Clang FI
CustomCode	6.82 ns	6.85 ns	6.47 ns	6.58 ns
FMTCompileMaster	14.5 ns	14.5 ns	16.0 ns	15.5 ns
FMTCompileMasterNOINLINERemovedFromFill	14.4 ns	14.5 ns	16.0 ns	15.5 ns
FMTCompileSeparateFormatter	16.6 ns	11.6 ns	15.4 ns	15.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	10.5 ns	9.80 ns	6.49 ns	6.55 ns
ToChars	5.45 ns	4.54 ns	4.92 ns	5.00 ns

note: GCC and GCC FI results cannot be directly compared because there is a time gap between these runs

Looks like this trick can only slice some time for the FMTCompileSeparateFormatter test on GCC, but for the current approach in {fmt} it does nothing.

std::string tests

Since this issue started with std::string based test, I decided to make a proper benchmark for this case. I took code from the first message, and since this code fills the result with leading zeros (up to 8 digits), the format string for {fmt}-based test is {:0>8x}. Also, I replaced ToChars test with StringStream test, just to be sure that stringstream still sucks.

Benchmark, `value == 42`	GCC	Clang
CustomCode	3.51 ns	3.24 ns
FMTCompileMaster	36.4 ns	28.1 ns
FMTCompileMasterNOINLINERemovedFromFill	36.5 ns	27.9 ns
FMTCompileSeparateFormatter	19.8 ns	18.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	14.9 ns	18.8 ns
StringStream	311 ns	322 ns

Considering these results looks like {fmt} has a big overhead, but that's quite false. Code from the first message is actually closer to the usage of fmt::format_to because the buffer size is available at compile-time. Just by changing this code like that:

std::string test_custom_code(unsigned value) {
  constexpr unsigned width = 2 * sizeof(value);
  std::string buffer;
  for (int i = 0; i < width; i++)
    buffer.append(1, hexchar[(value >> 4 * (width - 1 - i)) & 0xf]);
  return buffer;
}

the results from GCC (Clang still optimizes this code) become comparable:	Benchmark, `value == 273123`	Time
CustomCode	15.6 ns
FMTCompileMaster	35.4 ns
FMTCompileMasterNOINLINERemovedFromFill	35.4 ns
FMTCompileSeparateFormatter	19.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	14.7 ns
StringStream	321 ns

vitaut commented 3 years ago

AFAICS the part that prevents optimization for some reason is dynamic width/precision handling: https://github.com/fmtlib/fmt/blob/0f85a4683a137fa76ca7573ffbf4f7b118040b80/include/fmt/format.h#L3540-L3544

Temporarily commenting it out makes gcc do the expected optimization:

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------
CustomCode/0                                                                 1.40 ns         1.40 ns    499241121
CustomCode/42                                                                1.99 ns         1.99 ns    350273792
CustomCode/273123                                                            4.93 ns         4.93 ns    143643778
CustomCode/9223372036854775807                                               5.48 ns         5.48 ns    125456338
FMTCompileMaster/0                                                           1.20 ns         1.20 ns    582823513
FMTCompileMaster/42                                                          1.59 ns         1.59 ns    437414846
FMTCompileMaster/273123                                                      2.89 ns         2.89 ns    238470638
FMTCompileMaster/9223372036854775807                                         3.89 ns         3.89 ns    180152587
FMTCompileMasterNOINLINERemovedFromFill/0                                    9.17 ns         9.17 ns     75675294
FMTCompileMasterNOINLINERemovedFromFill/42                                   9.57 ns         9.57 ns     72413567
FMTCompileMasterNOINLINERemovedFromFill/273123                               10.6 ns         10.6 ns     65808614
FMTCompileMasterNOINLINERemovedFromFill/9223372036854775807                  11.6 ns         11.6 ns     60052123
FMTCompileSeparateFormatter/0                                                7.71 ns         7.71 ns     89553288
FMTCompileSeparateFormatter/42                                               8.52 ns         8.52 ns     81365037
FMTCompileSeparateFormatter/273123                                           10.7 ns         10.7 ns     64955329
FMTCompileSeparateFormatter/9223372036854775807                              12.0 ns         12.0 ns     58023609
FMTCompileSeparateFormatterNOINLINERemovedFromFill/0                         1.91 ns         1.91 ns    368577999
FMTCompileSeparateFormatterNOINLINERemovedFromFill/42                        2.63 ns         2.63 ns    265312761
FMTCompileSeparateFormatterNOINLINERemovedFromFill/273123                    5.57 ns         5.57 ns    126824185
FMTCompileSeparateFormatterNOINLINERemovedFromFill/9223372036854775807       8.26 ns         8.26 ns     84841458
ToChars/0                                                                    1.00 ns         1.00 ns    667463669
ToChars/42                                                                   1.99 ns         1.99 ns    350416544
ToChars/273123                                                               2.99 ns         2.99 ns    233728063
ToChars/9223372036854775807                                                  4.32 ns         4.32 ns    161760882

Need to figure out what specifically gcc doesn't like about the above code snippet.

alexezeder commented 3 years ago

Temporarily commenting it out makes gcc do the expected optimization:

Wow, these results are great! I didn't expect that compilers can optimize code like that.

This change makes FMTCompileMaster timings even better than timings for std::to_chars, despite {fmt} is way more generic. But that's true for GCC, not for Clang. On Clang, FMTCompileMaster timings become better with this change, but it's still slower than FMTCompileSeparateFormatterNOINLINERemovedFromFill. Here are combined results (DSH - dynamic specs handling):

Benchmark, `value == 0`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	1.76 ns	1.75 ns	1.59 ns	1.55 ns
FMTCompileMaster	1.50 ns	11.5 ns	6.82 ns	12.9 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	2.54 ns	2.52 ns	1.59 ns	1.56 ns
ToChars	1.24 ns	1.25 ns	1.25 ns	1.89 ns

Benchmark, `value == 42`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	2.55 ns	2.79 ns	2.26 ns	2.22 ns
FMTCompileMaster	2.00 ns	12.0 ns	7.49 ns	13.3 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.80 ns	3.53 ns	2.26 ns	2.22 ns
ToChars	2.90 ns	2.49 ns	2.25 ns	2.21 ns

Benchmark, `value == 273123`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	5.81 ns	6.15 ns	5.19 ns	5.03 ns
FMTCompileMaster	3.34 ns	13.3 ns	8.95 ns	14.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	8.26 ns	7.49 ns	5.51 ns	5.19 ns
ToChars	4.50 ns	3.76 ns	3.50 ns	3.44 ns

Benchmark, `value == 9223372036854775807`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	6.95 ns	6.82 ns	6.72 ns	6.47 ns
FMTCompileMaster	4.86 ns	14.5 ns	10.6 ns	16.0 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	11.3 ns	10.5 ns	6.67 ns	6.49 ns
ToChars	5.63 ns	5.45 ns	5.02 ns	4.92 ns

For some reason FMTCompileSeparateFormatterNOINLINERemovedFromFill is slower than FMTCompileMaster on GCC. I mean, they share the same {fmt} code, and int_formatter is way simpler than the current one 😐. But okay, I would investigate what's going on there.

In addition, I did a quick check for a bit more complex format string 0>8x. Here are also combined results:

Benchmark, `value == 0`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	4.07 ns	4.48 ns	5.27 ns	5.23 ns
FMTCompileMaster	21.5 ns	27.8 ns	14.2 ns	19.2 ns
FMTCompileMasterNOINLINERemovedFromFill	4.12 ns	27.7 ns	14.2 ns	19.2 ns
FMTCompileSeparateFormatter	11.6 ns	11.0 ns	9.72 ns	9.97 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.60 ns	5.97 ns	4.95 ns	5.48 ns

Benchmark, `value == 42`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	4.32 ns	5.61 ns	6.15 ns	6.08 ns
FMTCompileMaster	21.9 ns	28.8 ns	14.8 ns	19.6 ns
FMTCompileMasterNOINLINERemovedFromFill	4.84 ns	28.6 ns	15.0 ns	19.6 ns
FMTCompileSeparateFormatter	11.8 ns	11.8 ns	10.7 ns	10.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.96 ns	6.21 ns	6.29 ns	5.99 ns

Benchmark, `value == 273123`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	6.91 ns	8.07 ns	8.51 ns	8.44 ns
FMTCompileMaster	23.2 ns	29.7 ns	16.3 ns	20.9 ns
FMTCompileMasterNOINLINERemovedFromFill	5.72 ns	29.9 ns	16.2 ns	21.0 ns
FMTCompileSeparateFormatter	14.7 ns	15.1 ns	12.6 ns	13.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.98 ns	8.06 ns	9.37 ns	8.30 ns

Benchmark, `value == 9223372036854775807`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	7.53 ns	9.28 ns	8.38 ns	8.28 ns
FMTCompileMaster	20.5 ns	27.2 ns	14.0 ns	19.3 ns
FMTCompileMasterNOINLINERemovedFromFill	5.07 ns	27.5 ns	13.9 ns	19.0 ns
FMTCompileSeparateFormatter	16.9 ns	16.7 ns	14.9 ns	15.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	9.38 ns	9.33 ns	11.7 ns	10.6 ns

alexezeder commented 3 years ago

Ok, I realized why FMTCompileSeparateFormatter* are so slow compared to FMTCompileMaster* - it was probably a bad idea to take different revisions of {fmt} without rebasing separate formatter test code onto the master version. Especially when format.h was updated in master (integral formatting refactoring).

Results for `{:x}` format string:	Benchmark, `value == 42`	GCC
CustomCode	2.51 ns	2.27 ns
FMTCompileMaster	1.99 ns	7.41 ns
FMTCompileMasterNOINLINERemovedFromFill	2.00 ns	7.17 ns
FMTCompileSeparateFormatter	2.75 ns	2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	2.76 ns	2.08 ns
ToChars	2.10 ns	2.27 ns

Results for `{:0>8x}` format string:	Benchmark, `value == 42`	GCC
CustomCode	4.27 ns	6.19 ns
FMTCompileMaster	22.3 ns	14.8 ns
FMTCompileMasterNOINLINERemovedFromFill	4.82 ns	15.0 ns
FMTCompileSeparateFormatter	8.72 ns	6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.95 ns	5.32 ns

Still, FMTCompileMaster is a bit faster than FMTCompileSeparateFormatter on GCC, but it's either poor optimization for a non-type template parameter or some optimization magic which cannot be solved.

Anyway, even if those handle_dynamic_spec() calls cannot be optimized out sanely, then I think we still can have that static only specs optimization in {fmt}. This solution requires a new formatter that handles only basic_ not dynamic_ format specs. Also, there is a problem of detecting dynamic format specs on the string parsing step, because we cannot just look for { and } in specs to determine it is dynamic or static. But we can create the default formatter and look at the context argument identifier after calling parse() method of formatter.

alexezeder commented 3 years ago

Probably the final results for hexadecimal formatting.

I ran the same benchmarks for {:x} and {:0>8x} format strings, but this time they were compiled using MSVC (version 19.28.29913), since this compiler is also popular. Again, it's C++20 standard, it's a release configuration, everything remains the same. I also added results from GCC (Cygwin) on Windows just to demonstrate that GCC results on Linux and on Windows are almost identical.

Results for {:x} format string:

Benchmark, `value == 42`	MSVC	GCC (Cygwin)	GCC	Clang
CustomCode	4.14 ns	2.78 ns	2.51 ns	2.27 ns
FMTCompileMaster	38.5 ns	2.52 ns	1.99 ns	7.41 ns
FMTCompileMasterNOINLINERemovedFromFill	38.5 ns	2.02 ns	2.00 ns	7.17 ns
FMTCompileSeparateFormatter	7.87 ns	2.52 ns	2.75 ns	2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.89 ns	2.85 ns	2.76 ns	2.08 ns
ToChars	8.07 ns	2.02 ns	2.10 ns	2.27 ns

Results for {:0>8x} format string:

Benchmark, `value == 42`	MSVC	GCC (Cygwin)	GCC	Clang
CustomCode	6.85 ns	4.78 ns	4.27 ns	6.19 ns
FMTCompileMaster	45.8 ns	27.4 ns	22.3 ns	14.8 ns
FMTCompileMasterNOINLINERemovedFromFill	45.9 ns	5.37 ns	4.82 ns	15.0 ns
FMTCompileSeparateFormatter	16.7 ns	15.4 ns	8.72 ns	6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	16.7 ns	6.64 ns	5.95 ns	5.32 ns

Need to figure out what specifically gcc doesn't like about the above code snippet.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns). But the biggest impact comes from basic_format_args and it's get() method (~5 ns). Note that these observations are not very precise, because they actually just cannot be precise because we are talking about compiler optimizations here with all that magic under the hood.

vitaut commented 3 years ago

Thanks for such a comprehensive testing!

I didn't expect that compilers can optimize code like that.

In principle this is mostly inlining and constant propagation. I'm more surprised that clang does a poor job here. I'll try to dig into the generated assembly to see what's going on.

I think we still can have that static only specs optimization in {fmt}.

Right.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns).

True but it should be optimized away completely if dynamic specs are not used.

jovibor commented 3 years ago

@alexezeder That's a nice benchmarks indeed. I hope you'll add a std::format to this contest, when it's ready.

vitaut commented 3 years ago

After a recent series of optimizations the perf is on par with the handwritten code. Here are the results on clang and the current master (https://github.com/fmtlib/fmt/commit/d1aebdbde0fc59c7aa047af19ac58784f8a99e1d):

------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.86 ns         1.86 ns    351709307
CustomCode/42                                                     3.03 ns         3.02 ns    244754389
CustomCode/273123                                                 5.45 ns         5.43 ns    130468007
CustomCode/9223372036854775807                                    6.79 ns         6.77 ns    102097372
FMTCompileMaster/0                                                1.99 ns         1.99 ns    360523686
FMTCompileMaster/42                                               2.33 ns         2.33 ns    279865664
FMTCompileMaster/273123                                           3.72 ns         3.71 ns    190230315
FMTCompileMaster/9223372036854775807                              5.28 ns         5.26 ns    130711631
ToChars/0                                                         4.42 ns         4.41 ns    160196630
ToChars/42                                                        5.00 ns         4.98 ns    140735201
ToChars/273123                                                    7.26 ns         7.24 ns     95784130
ToChars/9223372036854775807                                       8.77 ns         8.75 ns     75872534

For comparison here's are results on an old version (7.1.3):

Running ./fmt_test
Run on (8 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 22.02, 12.70, 8.66
------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.81 ns         1.81 ns    369289856
CustomCode/42                                                     2.66 ns         2.63 ns    277249683
CustomCode/273123                                                 5.30 ns         5.27 ns    131162285
CustomCode/9223372036854775807                                    6.92 ns         6.86 ns    101821144
FMTCompileOld/0                                                   15.5 ns         15.5 ns     43302898
FMTCompileOld/42.                                                 16.6 ns         16.6 ns     43278267
FMTCompileOld/273123                                              18.7 ns         18.6 ns     37035861
FMTCompileOld/9223372036854775807                                 19.4 ns         19.4 ns     35243000
ToChars/0                                                         4.33 ns         4.31 ns    162237581
ToChars/42                                                        5.50 ns         5.44 ns    123228589
ToChars/273123                                                    8.47 ns         8.16 ns     92972600
ToChars/9223372036854775807                                       9.84 ns         9.52 ns     63743569