Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

ostream uses snprintf for formatting which results in 2x slower performance than libstdc++ #39734

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR40763
Status NEW
Importance P enhancement
Reported by Devin Hussey (husseydevin@gmail.com)
Reported on 2019-02-18 09:06:43 -0800
Last modified on 2019-02-19 12:19:34 -0800
Version 7.0
Hardware Macintosh MacOS X
CC dblaikie@gmail.com, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
I saw a lot of people complaining about this on Stack Overflow and the like but
I didn't see any bug reports.

std::cout is at least 2x slower in libc++ than it is in libstdc++. In
libstdc++, std::cout has about the same performance as stdio (and it is faster
with sync_with_stidio(false)), but in libc++, it can be more than 2x slower
than stdio.

Clang 7.0.1 vs GCC 8.2.0 (both from Homebrew)

Source code:

#include <iostream>
#include <cstdio>
#include <chrono>
#include <cstddef>
int main() {
    // true or false, libstdc++ is still much faster than libc++
    std::ios_base::sync_with_stdio(true);

    // sorry I prefer clock()
    auto clock = [] () { return std::chrono::high_resolution_clock::now(); };

    auto start = clock();
    for (std::size_t i = 0; i < 1000; i++) {
        std::cout << i << '\n';
    }
    auto end = clock();
    std::fprintf(stderr, "%llu\n", std::chrono::duration_cast<std::chrono::microseconds>(end - start).count());

    // show printf for comparison
    start = clock();
    for (std::size_t i = 0; i < 1000; i++) {
        std::printf("%zu\n", i);
    }
    end = clock();

    std::fprintf(stderr, "%llu\n", std::chrono::duration_cast<std::chrono::microseconds>(end - start).count());
}

Problem:

libc++'s std::cout is twice as slow as libstdc++'s.

~ $ clang++ -O3 bench.cpp -std=c++2a
~ $ ./a.out > /dev/null
690
265

~ $ g++-8 -O3 bench.cpp -std=c++2a
~ $ ./a.out > /dev/null
375
284

To prove it isn't a GCC optimization, I forced Clang to use libstdc++:

~ $ clang++ bench.cpp -O3 -stdlib=libstdc++ -std=c++2a -isystem
/usr/local/opt/gcc/include/c++/8.2.0/ -isystem
/usr/local/opt/gcc/include/c++/8.2.0/x86_64-apple-darwin17.7.0/ -L
/usr/local/opt/gcc/lib/gcc/8
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on
the command line to use the libc++ standard library instead [-Wstdlibcxx-not-
found]
1 warning generated.
~ $ ./a.out > /dev/null
320
245

Changing sync_with_stdio to false yields cout being faster than printf on
libstdc++, but the same on libc++:

clang++ with libc++:
~ $ ./a.out > /dev/null
697
290

g++ (note: varies a lot)
~ $ ./a.out > /dev/null
164
266

clang++ with libstdc++:
~ $ ./a.out > /dev/null
163
269

Something is clearly off here. Even though stdio has the same performance, the
performance difference is massive.

Note: Xcode 10's Clang has the same effect, and changing the clock printf to
cerr has no effect.

macOS 10.14.2 Mojave (Mojave Patcher)
2.0 GHz MacBook Pro (15-inch, Early 2011)
8 GB RAM
500 GB SSD
2.0 GHz Intel Core i7 "Sandy Bridge" (2635QM), quad core + HT
clang++ 7.0.2
g++ 8.2.0
Quuxplusone commented 5 years ago
After profiling the code, I noticed that there are calls to snprintf_l via
std::num_put<char>::do_put, followed by fwrite.

gcc doesn't ever call any of the printf functions, it actually uses its own
function to turn integers to strings:
    https://code.woboq.org/gcc/libstdc++-v3/include/bits/locale_facets.tcc.html#_ZSt13__int_to_charPT_T0_PKS_St13_Ios_Fmtflagsb

If I use strings to do the operation, for example, filling an array with
std::to_string() on each of the numbers beforehand and then using the array
items instead it yields decent performance.
Quuxplusone commented 5 years ago
This plus std::cout.write(str, &str[sizeof(str) - 1] - str) gives great
performance, under 200 microseconds and ~70-90 microseconds on gcc with
sync_with_stdio(false).

alignas(8)
static const char lut[] =
  "00010203040506070809"
  "10111213141516171819"
  "20212223242526272829"
  "30313233343536373839"
  "40414243444546474849"
  "50515253545556575859"
  "60616263646566676869"
  "70717273747576777879"
  "80818283848586878889"
  "90919293949596979899";

// Reinterpret lut as a short pointer.
static const short* lut16 = reinterpret_cast<const short*>(lut);

template <typename T>
char* int_to_string_base10(char *buf, T value) {
  bool negative = false;
  typedef typename std::make_signed<T>::type unsigned_type;
  unsigned_type val;
  // Remove negative and convert to unsigned.
  if (std::is_signed<T>::value && value < 0) {
    negative = true;
    val = static_cast<unsigned_type>(-static_cast<typename std::make_signed<T>::type>(value));
  } else {
    val = static_cast<unsigned_type>(value);
  }
  if (val >= 100) {
    // avoid unaligned access
    if ((reinterpret_cast<std::uintptr_t>(buf) & 1) == 1) {
      *--buf = (val % 10) + '0';
      val /= 10;
    }
    // do this two bytes at a time, see Facebook's Three Optimization Tips for C++
    short* buf16  = reinterpret_cast<short*>(buf);
    while (val >= 100) {
      *--buf16 = lut16[val % 100];
      val /= 100;
    }
    buf = reinterpret_cast<char *>(buf16);
  }
  // Finish up the rest
  do {
    *--buf = (val % 10) + '0';
    val /= 10;
  } while (val != 0);

  // Add the negative sign
  if (std::is_signed<T>::value && negative) {
    *--buf = '-';
  }
  return buf;
}

However, if adding '\n' is not possible, performance still isn't bad. Is the
compiler allowed to optimize cout << '\n' like it can do with printf->puts?