Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Stack arrays (C-style array and std::array) sometimes compile to ~50% slower code than std::vector for array access #15986

Open Quuxplusone opened 11 years ago

Quuxplusone commented 11 years ago
Bugzilla Link PR15986
Status NEW
Importance P normal
Reported by Mike S (mschamer@zoho.com)
Reported on 2013-05-13 16:54:09 -0700
Last modified on 2013-05-17 03:53:56 -0700
Version trunk
Hardware PC Linux
CC anton@korobeynikov.info, craig.topper@gmail.com, llvm-bugs@lists.llvm.org, rafael@espindo.la
Fixed by commit(s)
Attachments SlowerArrayExample.cpp (5291 bytes, text/x-c++src)
SlowerArrayExample.cpp (5313 bytes, text/x-c++src)
ArrayVsVectorSelfBenchmarking.cpp (6521 bytes, text/x-c++src)
Blocks
Blocked by
See also
Created attachment 10507
Test case:  Comment/uncomment typedefs in main() function to test array vs.
vector.

I'm writing a complementary multiply-with carry random number generator that
can be configured with templates to use an std::vector or std::array (or C-
style array) for its internal state.

Strangely, Clang/LLVM generates significantly slower code when stack-based
arrays are used as a drop-in replacement for vectors.  GCC 4.8 generates
performance-comparable code for each, and Clang's vector-based code gives
comparable speed to GCC.  I'm not familiar enough with assembly to know whether
Clang is accessing stack arrays in a fundamentally suboptimal way when
surrounded by certain computations, or if the stack arrays are merely
triggering different code generation for the surrounding computations, but the
resulting assembly has a number of differences, and it's quite a bit worse with
stack arrays.

Compilation line:
    clang++ -Wall -Wextra -std=c++11 -msse4.1 -O3 SlowerArrayExample.cpp

Test line:
    time ./a.out

Array Timing:
    real    0m2.875s
    user    0m2.833s
    sys 0m0.002s

Vector Timing:
    real    0m1.994s
    user    0m1.969s
    sys 0m0.002s

I've controlled for alignment/cache issues by using alignas(128) with the stack
arrays, and I've controlled for initialization caching issues by zero-
initializing in both cases.  My original test program used clock_gettime()
around the test loop only, so the issue is related to the array accesses
(either reads or writes) in the test loop, not array/vector construction or
initialization.

This is pretty unintuitive:  I'd expect array/vector access to be equivalent.
If there's any performance difference at all, I'd expect it to work in slight
favor of stack arrays, since the start address of a stack array can be directly
calculated based on the stack pointer, whereas the start address of a vector
needs to be loaded from a member pointer.

I've attached the most minimal problem case I can find, but I've tried several
things to further isolate the "triggers" for this behavior, and the following
information may be helpful:

WHAT SOURCE CHANGES EQUALIZE VECTOR/ARRAY IMPLEMENTATIONS?
The slowdown with stack arrays disappears if I remove all of the computations
and turn rand() into something silly like this:
    result_type rand()
    {
        //  Not semantically valid:
        const size_t next_circular_index = (circular_index_ == (r - 1)) ? 0 :
circular_index_ + 1;
        result_type result = state_[circular_index_] + a;
        state_[circular_index_] = result;
        circular_index_ = next_circular_index;
        return result;
    }

In other words, the slow stack array accesses only emerge in the presence of
some minimal amount of surrounding computation.  Also, the array and vector
versions perform equally [slowly] if I replace the ternary operator (in the
original code) with the following:
        const size_t next_circular_index = (circular_index_ + 1) % r;

Using a full modulus operation makes the generator run a full three times
slower than the original vector version, but the array/vector versions do
technically perform equally in that case.  However, the array slowdown does not
seem specifically related to the branch in the ternary operator, because arrays
are still slower than vectors if I replace the line with the following (valid
only for power-of-2 r's):
        const size_t next_circular_index = (circular_index_ + 1) & (r - 1);

WHAT SOURCE CHANGES SEEM IRRELEVANT?
The slowdown doesn't seem to be a result of suboptimal source code, because
it's pretty much unaffected by source-level micro-optimization.

That is to say, I've tried all of the following in an attempt to debug this (in
addition to what I mentioned above):
- Saving every member/array read to a temporary local variable and rearranging
the code lines into literally 600 different semantically valid permutations
- Manually replacing division and modulus with shift and "and" operations for
base 2^32 (and something a bit more complicated for base 2^32 - 1)
- Manually replacing multiplication with shift-and-add/subtract-shifted for
relevant multiplier 'a' values (multipliers that are powers of 2 away from
powers of 2)

Some of these source-level tweaks affect the overall speed of the generator,
but the stack version always incurs a slowdown compared to the vector version.
The slowdown doesn't seem dependent on the specific presence of multiply or
division operators in the source.  However, it could still be dependent on
shift instructions, "and" instructions, branches, or combinations thereof,
since the arithmetically gutted code above doesn't reproduce the problem, and
neither does the [slower] code using a modulus operation to update the circular
buffer index.
Quuxplusone commented 11 years ago

Attached SlowerArrayExample.cpp (5291 bytes, text/x-c++src): Test case: Comment/uncomment typedefs in main() function to test array vs. vector.

Quuxplusone commented 11 years ago

Attached SlowerArrayExample.cpp (5313 bytes, text/x-c++src): Test case (with fixed memset bug): Comment/uncomment typedefs in main() function to test array vs. vector.

Quuxplusone commented 11 years ago
After some more testing, I have some more details that might help track this
down.

This is what I get for doing all of my extensive testing and timing with
different code permutations *before* minimizing the test case:  As it turns
out, rearranging the order of critical lines in the specific test case I
attached has a much greater effect than I previously indicated.  Specifically,
using the following rand() implementation gives ideal (and comparable)
performance for both arrays and vectors:

    result_type rand()
    {
        const result_type max = (result_type)(b - 1);
        const size_t next_circular_index = (circular_index_ == (r - 1)) ? 0 :
circular_index_ + 1;
        const uint_fast64_t mwc = (uint_fast64_t(state_[circular_index_]) * a) +
carry_;
        const result_type result = max - (result_type)(mwc % b);
        state_[circular_index_] = result;
        circular_index_ = next_circular_index;
        carry_ = mwc / b;
        return result;
    }

The main difference of course is that the final write to the array is done as
early as possible, which masks the latency when the optimizer maintains that
order.  Strangely, vectors still perform much better with this code in the
context of my full (messier) test program, so even this can still be a
pathological case depending on the surrounding code.

Long story short, the underlying code generation bug still remains:  For some
reason, there are pathological cases where Clang produces much slower code for
array accesses than for vector accesses (perhaps specifically array/vector
writes), which is quite unintuitive.
Quuxplusone commented 11 years ago

Attached ArrayVsVectorSelfBenchmarking.cpp (6521 bytes, text/x-c++src): Test case: Self-benchmark arrays vs. vectors with Linux sys/time.h.

Quuxplusone commented 11 years ago
Oops, I sent that last comment prematurely.

The newer test case produces bloated assembly due to all of the includes, and
benchmarking arrays and vectors in the same executable runs the risk of
differences from caching, but it gives an interesting perspective nonetheless.
Here are some benchmarks:

1.) Non-inlined BenchmarkRNG(), slow circular buffer updates:
    GCC 4.7 Benchmark Time:
        Array     :  2.440034236s
        Vector    :  2.365274656s

    GCC 4.8 Benchmark Time:
        Array     :  2.526191167s
        Vector    :  2.463438398s

    Clang 3.2 Benchmark Time:
        Array     :  2.331447587s
        Vector    :  2.547286271s

    Clang 3.3+ Benchmark Time:
        Array     :  3.271941898s
        Vector    :  2.553101321s

2.) Non-inlined BenchmarkRNG(), fast circular buffer updates:
    GCC 4.7 Benchmark Time:
        Array     :  2.104370443s
        Vector    :  1.901762568s

    GCC 4.8 Benchmark Time:
        Array     :  1.953596915s
        Vector    :  1.942452867s

    Clang 3.2 Benchmark Time:
        Array     :  2.354230493s
        Vector    :  1.824467609s

    Clang 3.3+ Benchmark Time:
        Array     :  2.749077046s
        Vector    :  1.85680768s

3.) Inlined BenchmarkRNG(), slow circular buffer updates:
    GCC 4.7 Benchmark Time:
        Array     :  2.132182857s
        Vector    :  2.30057284s

    GCC 4.8 Benchmark Time:
        Array     :  2.360156631s
        Vector    :  2.610849987s

    Clang 3.2 Benchmark Time:
        Array     :  2.116218351s
        Vector    :  2.003197595s

    Clang 3.3+ Benchmark Time:
        Array     :  2.08847722s
        Vector    :  2.034692403s

4.) Inlined BenchmarkRNG(), fast circular buffer updates:
    GCC 4.7 Benchmark Time:
        Array     :  2.186518931s
        Vector    :  1.853176545s

    GCC 4.8 Benchmark Time:
        Array     :  2.211062684s
        Vector    :  1.847149512s

    Clang 3.2 Benchmark Time:
        Array     :  1.779760716s
        Vector    :  1.768924408s

    Clang 3.3+ Benchmark Time:
        Array     :  1.848627263s
        Vector    :  1.830769086s

Using Clang 3.2, arrays are only significantly slower than vectors for case 2.
Using Clang 3.3+, arrays are significantly slower than vectors for cases 1 and
2, and both are regressions from Clang 3.2.
Quuxplusone commented 11 years ago

One interesting thing I noticed. The array version ends up with stores and loads to circularindex and carry_ inside the loop. In the vector version these are able to stay in registers through the loop.

If you make the cmwc_state_array a wrapper around a heap allocated array instead of a statically sized array that will end on the stack it should perform similarly to the std::vector version.

When the stack size of the cmwc_engine_minimal object is large, llvm seems to be preserving all loads/stores to it including to circualindex and carry_. When its small as it would be for a vector or heap allocated array, llvm is able to remove the accesses.

Quuxplusone commented 11 years ago
Good catch recognizing the extra loads and stores, Craig. :)

I figured the difference revolved around stack vs. heap storage, as opposed to
raw arrays vs. encapsulated arrays:  The equal performance of std::array and
raw stack arrays indicated as much, but it's nice to have confirmation.  I
can't actually change cmwc_state_array to use heap allocation though, since
that's what cmwc_state_vector is for:
The point of cmwc_state_array is to provide an option for stack allocation in
the event someone wants to contiguously store a large number of small,
independent generators, or if the heap isn't an option for some other reason.
Those are probably niche use cases, but the general problem of subpar code
generation around stack arrays is bound to pop up in a lot of other contexts as
well.

Anyway, I guess the implication here is that "larger" stack objects may be
making LLVM "give up" on keeping frequently-used members in registers, or
something along those lines.  Unfortunately, even a 32-byte object seems to
qualify as large enough to run up against this behavior.

Strangely enough though, for the specific test case I've given, the vector and
array implementations both presumably take up the same 32 bytes of state on the
stack anyway (given a 64-bit architecture).  Both store the carry and circular
buffer on the stack, and I'm guessing the 4 32-bit values in the array version
are the same size as what I expect are a 64-bit heap pointer and 64-bit size
member in the vector version.

The compiler's job seems easier for the vector version, but it seems like there
could be a few things causing the difference:
- Is it because LLVM sees fewer options with vector, and picks the obvious ones
for persistent register storage?  The vector size is never loaded at all, and
if the pointer to the heap causes LLVM to ignore the actual heap array,
circular_buffer_index_ and carry_ are some of the only variables still left
standing.
- Is it because LLVM sees so many options with vector that it aggressively
prunes anything having to do with the heap array, leaving more obvious choices
for persistent register storage?  The combination of (heap pointer + vector
size + heap array elements) could conceivably give LLVM more things to think
about, so it might be throwing some out.
- Is it just a bug arising from some kind of internal edge case or corner case?

I'm just thinking aloud here of course...