Next gen non-allocating constexpr-folding future-promise does not optimise well on clang

Quuxplusone commented 9 years ago


Bugzilla Link	PR23652
Status	NEW
Importance	P normal
Reported by	Niall Douglas (s_bugzilla@nedprod.com)
Reported on	2015-05-25 19:03:49 -0700
Last modified on	2015-06-11 08:51:38 -0700
Version	trunk
Hardware	PC Linux
CC	anton@korobeynikov.info, chandlerc@gmail.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

As part of working on next generation non-allocating constexpr-folding future-
promises for the Boost.Thread rewrite (and with the hope these become the next
STL future-promises), clang currently does not perform ideally as compared to
GCC.

I have spoken with Chandler Carruth about these at C++ Now, and he may chime in
here about the importance of clang matching GCC in performance with these. I am
also raising these with colleagues on the MSVC team, as poor old VS2015
generates about 3000 opcodes for the last example :(.

Anyway as a quick summary, under these next-gen future-promises this sequence:

extern BOOST_SPINLOCK_NOINLINE int test1()
{
  using namespace boost::spinlock::lightweight_futures;
  monad<int, true> m(5);
  return m.get();
}

... should turn into:

0000000000000000 <_Z5test1v>:
   0:   b8 05 00 00 00          mov    $0x5,%eax
   5:   c3                      retq

... and indeed does under GCC, but under clang 3.6 and 3.7 turns into:

0000000000000000 <_Z5test1v>:
   0:   53                      push   %rbx
   1:   48 83 ec 20             sub    $0x20,%rsp
   5:   c7 44 24 08 05 00 00    movl   $0x5,0x8(%rsp)
   c:   00
   d:   c7 44 24 18 01 00 00    movl   $0x1,0x18(%rsp)
  14:   00
  15:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  1a:   e8 00 00 00 00          callq  1f <_Z5test1v+0x1f>
  1f:   89 c3                   mov    %eax,%ebx
  21:   8b 44 24 18             mov    0x18(%rsp),%eax
  25:   ff c8                   dec    %eax
  27:   83 f8 03                cmp    $0x3,%eax
  2a:   77 24                   ja     50 <_Z5test1v+0x50>
  2c:   ff 24 c5 00 00 00 00    jmpq   *0x0(,%rax,8)
  33:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  38:   e8 00 00 00 00          callq  3d <_Z5test1v+0x3d>
  3d:   eb 09                   jmp    48 <_Z5test1v+0x48>
  3f:   48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)
  46:   00 00
  48:   c7 44 24 18 00 00 00    movl   $0x0,0x18(%rsp)
  4f:   00
  50:   89 d8                   mov    %ebx,%eax
  52:   48 83 c4 20             add    $0x20,%rsp
  56:   5b                      pop    %rbx
  57:   c3                      retq
  58:   48 89 c3                mov    %rax,%rbx
  5b:   8b 44 24 18             mov    0x18(%rsp),%eax
  5f:   ff c8                   dec    %eax
  61:   83 f8 03                cmp    $0x3,%eax
  64:   77 24                   ja     8a <_Z5test1v+0x8a>
  66:   ff 24 c5 00 00 00 00    jmpq   *0x0(,%rax,8)
  6d:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  72:   e8 00 00 00 00          callq  77 <_Z5test1v+0x77>
  77:   eb 09                   jmp    82 <_Z5test1v+0x82>
  79:   48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)
  80:   00 00
  82:   c7 44 24 18 00 00 00    movl   $0x0,0x18(%rsp)
  89:   00
  8a:   48 89 df                mov    %rbx,%rdi
  8d:   e8 00 00 00 00          callq  92 <_Z5test1v+0x92>
  92:   66 66 66 66 66 2e 0f    data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
  99:   1f 84 00 00 00 00 00

This is highly unfortunate, because monad is the base class of future, and
therefore this promise-future sequence:

extern BOOST_SPINLOCK_NOINLINE int test1()
{
  using namespace boost::spinlock::lightweight_futures;
  promise<int> p;
  p.set_value(5);
  future<int> f(p.get_future());
  return f.get();
}

... which under GCC correctly turns into:

0000000000000010 <_Z5test1v>:
  10:   b8 05 00 00 00          mov    $0x5,%eax
  15:   c3                      retq

... under clang 3.6 and 3.7 most unfortunately turns into:

0000000000000000 <_Z5test1v>:
   0:   53                      push   %rbx
   1:   48 83 ec 50             sub    $0x50,%rsp
   5:   c7 44 24 40 00 00 00    movl   $0x0,0x40(%rsp)
   c:   00
   d:   c6 44 24 48 00          movb   $0x0,0x48(%rsp)
  12:   c7 44 24 2c 05 00 00    movl   $0x5,0x2c(%rsp)
  19:   00
  1a:   48 8d 5c 24 30          lea    0x30(%rsp),%rbx
  1f:   48 8d 74 24 2c          lea    0x2c(%rsp),%rsi
  24:   48 89 df                mov    %rbx,%rdi
  27:   e8 00 00 00 00          callq  2c <_Z5test1v+0x2c>
  2c:   48 8d 3c 24             lea    (%rsp),%rdi
  30:   48 89 de                mov    %rbx,%rsi
  33:   e8 00 00 00 00          callq  38 <_Z5test1v+0x38>
  38:   48 8d 3c 24             lea    (%rsp),%rdi
  3c:   e8 00 00 00 00          callq  41 <_Z5test1v+0x41>
  41:   89 c3                   mov    %eax,%ebx
  43:   48 8d 3c 24             lea    (%rsp),%rdi
  47:   e8 00 00 00 00          callq  4c <_Z5test1v+0x4c>
  4c:   48 8d 7c 24 30          lea    0x30(%rsp),%rdi
  51:   e8 00 00 00 00          callq  56 <_Z5test1v+0x56>
  56:   89 d8                   mov    %ebx,%eax
  58:   48 83 c4 50             add    $0x50,%rsp
  5c:   5b                      pop    %rbx
  5d:   c3                      retq
  5e:   48 89 c3                mov    %rax,%rbx
  61:   eb 0c                   jmp    6f <_Z5test1v+0x6f>
  63:   48 89 c3                mov    %rax,%rbx
  66:   48 8d 3c 24             lea    (%rsp),%rdi
  6a:   e8 00 00 00 00          callq  6f <_Z5test1v+0x6f>
  6f:   48 8d 7c 24 30          lea    0x30(%rsp),%rdi
  74:   e8 00 00 00 00          callq  79 <_Z5test1v+0x79>
  79:   48 89 df                mov    %rbx,%rdi
  7c:   e8 00 00 00 00          callq  81 <_Z5test1v+0x81>
  81:   66 66 66 66 66 66 2e    data32 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
  88:   0f 1f 84 00 00 00 00
  8f:   00

... which I should imagine would be quite a performance penalty.

I asked clang to -save-temps, and the dump for all the unit tests along with
the command options used can be found at:

clang 3.5:

https://drive.google.com/file/d/0B5QDPUNHLpKMcTJXd2lqZ1lKNTA/view?usp=sharing

clang 3.6:

https://drive.google.com/file/d/0B5QDPUNHLpKMQ1g1SU9WbUJiWWc/view?usp=sharing

clang 3.7:

https://drive.google.com/file/d/0B5QDPUNHLpKMaUNNWXhqSi1oM3c/view?usp=sharing

Niall

Quuxplusone commented 9 years ago

An update to this bug: the Microsoft Visual Studio team have very kindly helped
me get this code to persuade MSVC to collapse these lightweight future promises
into no opcodes output, and the current stats are as follows for x64 opcodes
generated for each of these tests:

                                       GCC 5.1   VS2015   clang 3.7
future_construct_destruct.cpp             0         3         7
future_construct_move_destruct.cpp        0         3        18
monad_construct_destruct.cpp              0         3         0
monad_construct_value_destruct.cpp        1         4        33
promise_construct_destruct.cpp            0         3         6
promise_construct_move_destruct.cpp       0         3        18
promise_future_reduce.cpp                 1      3111        32

GCC has *perfect* results. The terrible result for promise_future_reduce on
VS2015 is hoped to be fixed by Microsoft shortly after RTM.

This now makes clang 3.7 look particularly poor relative to GCC and MSVC.

I'll give you fair warning now: I expect to present on these at C++ conferences
in Bristol and Seattle from Spring 2016 onwards. If clang is still performing
poorly relative to the others, I'll say so publicly, and with benchmarks.

Niall

Quuxplusone commented 9 years ago

(In reply to comment #1)
> GCC has *perfect* results. The terrible result for promise_future_reduce on
> VS2015 is hoped to be fixed by Microsoft shortly after RTM.
>
> This now makes clang 3.7 look particularly poor relative to GCC and MSVC.

More useful information for this bug report. For just the monadic transport
only, here are the best case and worst case opcode generation results:

clang 3.7
59 opcodes <= Value transport <= 37 opcodes
7 opcodes <= Error transport <= 52 opcodes
38 opcodes <= Exception transport <= 39 opcodes

GCC 5.1
1 opcodes <= Value transport <= 113 opcodes
8 opcodes <= Error transport <= 119 opcodes
22 opcodes <= Exception transport <= 214 opcodes

VS2015
4 opcodes <= Value transport <= 1881 opcodes
6 opcodes <= Error transport <= 164 opcodes
1946 opcodes <= Exception transport <= 1936 opcodes

The maximum is calculated by taking a monad in from a non-visible source where
the compiler has to generate all code paths to handle an unknown (variant)
input state, whereas the minimum is calculated by setting a monad's state in
view of the compiler's optimiser such that it can usually completely elide
opcodes generated (though note that varies enormously by compiler to the extent
that the known code generates more opcodes than the unknown code). From an
optimiser's perspective, in the minimum case it can elide all code paths
forwards which could never be executed anyway, whilst for the maximum case it
can elide all code paths backwards on the basis that the only two outcomes are
fetching the value/error/exception or throwing an exception.

In terms of maximum opcodes generated, clang 3.7 does very well indeed, and the
best of any of the compilers showing it optimises backwards very well. GCC 5.1
does the best in minimum opcodes generated showing it optimises forwards very
well. VS2015 is somewhere between great and terrible, though it has no choice
if an exception_ptr is ever touched as Microsoft require some unfortunately
very hard memory effects due to backward ABI compatibility.

If you could get clang's minimum case (forward optimisation) as good as GCC,
that would be enormous.

Niall

Quuxplusone / LLVMBugzillaTest

Next gen non-allocating constexpr-folding future-promise does not optimise well on clang #23651