Open Quuxplusone opened 11 years ago
Attached SlowerArrayExample.cpp
(5291 bytes, text/x-c++src): Test case: Comment/uncomment typedefs in main() function to test array vs. vector.
Attached SlowerArrayExample.cpp
(5313 bytes, text/x-c++src): Test case (with fixed memset bug): Comment/uncomment typedefs in main() function to test array vs. vector.
After some more testing, I have some more details that might help track this
down.
This is what I get for doing all of my extensive testing and timing with
different code permutations *before* minimizing the test case: As it turns
out, rearranging the order of critical lines in the specific test case I
attached has a much greater effect than I previously indicated. Specifically,
using the following rand() implementation gives ideal (and comparable)
performance for both arrays and vectors:
result_type rand()
{
const result_type max = (result_type)(b - 1);
const size_t next_circular_index = (circular_index_ == (r - 1)) ? 0 :
circular_index_ + 1;
const uint_fast64_t mwc = (uint_fast64_t(state_[circular_index_]) * a) +
carry_;
const result_type result = max - (result_type)(mwc % b);
state_[circular_index_] = result;
circular_index_ = next_circular_index;
carry_ = mwc / b;
return result;
}
The main difference of course is that the final write to the array is done as
early as possible, which masks the latency when the optimizer maintains that
order. Strangely, vectors still perform much better with this code in the
context of my full (messier) test program, so even this can still be a
pathological case depending on the surrounding code.
Long story short, the underlying code generation bug still remains: For some
reason, there are pathological cases where Clang produces much slower code for
array accesses than for vector accesses (perhaps specifically array/vector
writes), which is quite unintuitive.
Attached ArrayVsVectorSelfBenchmarking.cpp
(6521 bytes, text/x-c++src): Test case: Self-benchmark arrays vs. vectors with Linux sys/time.h.
Oops, I sent that last comment prematurely.
The newer test case produces bloated assembly due to all of the includes, and
benchmarking arrays and vectors in the same executable runs the risk of
differences from caching, but it gives an interesting perspective nonetheless.
Here are some benchmarks:
1.) Non-inlined BenchmarkRNG(), slow circular buffer updates:
GCC 4.7 Benchmark Time:
Array : 2.440034236s
Vector : 2.365274656s
GCC 4.8 Benchmark Time:
Array : 2.526191167s
Vector : 2.463438398s
Clang 3.2 Benchmark Time:
Array : 2.331447587s
Vector : 2.547286271s
Clang 3.3+ Benchmark Time:
Array : 3.271941898s
Vector : 2.553101321s
2.) Non-inlined BenchmarkRNG(), fast circular buffer updates:
GCC 4.7 Benchmark Time:
Array : 2.104370443s
Vector : 1.901762568s
GCC 4.8 Benchmark Time:
Array : 1.953596915s
Vector : 1.942452867s
Clang 3.2 Benchmark Time:
Array : 2.354230493s
Vector : 1.824467609s
Clang 3.3+ Benchmark Time:
Array : 2.749077046s
Vector : 1.85680768s
3.) Inlined BenchmarkRNG(), slow circular buffer updates:
GCC 4.7 Benchmark Time:
Array : 2.132182857s
Vector : 2.30057284s
GCC 4.8 Benchmark Time:
Array : 2.360156631s
Vector : 2.610849987s
Clang 3.2 Benchmark Time:
Array : 2.116218351s
Vector : 2.003197595s
Clang 3.3+ Benchmark Time:
Array : 2.08847722s
Vector : 2.034692403s
4.) Inlined BenchmarkRNG(), fast circular buffer updates:
GCC 4.7 Benchmark Time:
Array : 2.186518931s
Vector : 1.853176545s
GCC 4.8 Benchmark Time:
Array : 2.211062684s
Vector : 1.847149512s
Clang 3.2 Benchmark Time:
Array : 1.779760716s
Vector : 1.768924408s
Clang 3.3+ Benchmark Time:
Array : 1.848627263s
Vector : 1.830769086s
Using Clang 3.2, arrays are only significantly slower than vectors for case 2.
Using Clang 3.3+, arrays are significantly slower than vectors for cases 1 and
2, and both are regressions from Clang 3.2.
One interesting thing I noticed. The array version ends up with stores and loads to circularindex and carry_ inside the loop. In the vector version these are able to stay in registers through the loop.
If you make the cmwc_state_array a wrapper around a heap allocated array instead of a statically sized array that will end on the stack it should perform similarly to the std::vector version.
When the stack size of the cmwc_engine_minimal object is large, llvm seems to be preserving all loads/stores to it including to circualindex and carry_. When its small as it would be for a vector or heap allocated array, llvm is able to remove the accesses.
Good catch recognizing the extra loads and stores, Craig. :)
I figured the difference revolved around stack vs. heap storage, as opposed to
raw arrays vs. encapsulated arrays: The equal performance of std::array and
raw stack arrays indicated as much, but it's nice to have confirmation. I
can't actually change cmwc_state_array to use heap allocation though, since
that's what cmwc_state_vector is for:
The point of cmwc_state_array is to provide an option for stack allocation in
the event someone wants to contiguously store a large number of small,
independent generators, or if the heap isn't an option for some other reason.
Those are probably niche use cases, but the general problem of subpar code
generation around stack arrays is bound to pop up in a lot of other contexts as
well.
Anyway, I guess the implication here is that "larger" stack objects may be
making LLVM "give up" on keeping frequently-used members in registers, or
something along those lines. Unfortunately, even a 32-byte object seems to
qualify as large enough to run up against this behavior.
Strangely enough though, for the specific test case I've given, the vector and
array implementations both presumably take up the same 32 bytes of state on the
stack anyway (given a 64-bit architecture). Both store the carry and circular
buffer on the stack, and I'm guessing the 4 32-bit values in the array version
are the same size as what I expect are a 64-bit heap pointer and 64-bit size
member in the vector version.
The compiler's job seems easier for the vector version, but it seems like there
could be a few things causing the difference:
- Is it because LLVM sees fewer options with vector, and picks the obvious ones
for persistent register storage? The vector size is never loaded at all, and
if the pointer to the heap causes LLVM to ignore the actual heap array,
circular_buffer_index_ and carry_ are some of the only variables still left
standing.
- Is it because LLVM sees so many options with vector that it aggressively
prunes anything having to do with the heap array, leaving more obvious choices
for persistent register storage? The combination of (heap pointer + vector
size + heap array elements) could conceivably give LLVM more things to think
about, so it might be throwing some out.
- Is it just a bug arising from some kind of internal edge case or corner case?
I'm just thinking aloud here of course...
SlowerArrayExample.cpp
(5291 bytes, text/x-c++src)SlowerArrayExample.cpp
(5313 bytes, text/x-c++src)ArrayVsVectorSelfBenchmarking.cpp
(6521 bytes, text/x-c++src)