poor codegen for unaligned fixed-size memcpy/memmove

Quuxplusone commented 9 years ago


Bugzilla Link	PR21541
Status	RESOLVED FIXED
Importance	P normal
Reported by	Sean Silva (silvasean@google.com)
Reported on	2014-11-11 23:56:14 -0800
Last modified on	2014-12-02 10:26:31 -0800
Version	trunk
Hardware	PC All
CC	andrea.dibiagio@gmail.com, geek4civic@gmail.com, greg.bedwell@sony.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, rob.lougher.llvm@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`memtester.c` (5474 bytes, text/plain) `memloops.s` (4072 bytes, text/plain)
Blocks
Blocked by
See also	PR21709

void copy32byte(void *pDst, const void *pSrc) {
  __builtin_memmove(pDst, pSrc, 32);
}

clang++ -O3 -march=btver2 -c testcopy.cpp -o test.o

generates:

__Z10copy32bytePvPKv:
0000000000000000        pushq   %rbp
0000000000000001        movq    %rsp, %rbp
0000000000000004        movq    (%rsi), %rax
0000000000000007        movq    0x8(%rsi), %rcx
000000000000000b        movq    0x10(%rsi), %rdx
000000000000000f        movq    0x18(%rsi), %rsi
0000000000000013        movq    %rsi, 0x18(%rdi)
0000000000000017        movq    %rdx, 0x10(%rdi)
000000000000001b        movq    %rcx, 0x8(%rdi)
000000000000001f        movq    %rax, (%rdi)
0000000000000022        popq    %rbp
0000000000000023        retq

Quuxplusone commented 9 years ago

To be clear, the ideal code in this case should probably be something like:

... frame stuff ...
vmovdqu (%rsi),%ymm0
vmovdqu %ymm0,(%rdi)
... frame stuff ...

Quuxplusone commented 9 years ago

There's a hack in X86ISelLowering.cpp around line 23927 that will likely
prevent 32-byte memops (but not 16-byte memops) for btver2:

   // On Sandybridge unaligned 256bit loads are inefficient.

The hack is that instead of adding a feature flag for slow 256-bit unaligned
load/store and using that only for SandyBridge, all AVX but non-AVX2 CPUs get
the treatment via:

   if (...!Subtarget->hasInt256())

I haven't measured unaligned 256-bit memops on btver2, so this might be the
right way to go anyway if it turns out those are slow, but we should still fix
that hack. It's never good to use a CPU model as a proxy for some bit of micro-
arch specialization.

Quuxplusone commented 9 years ago

Hi Sean,

On 'btver2', if you want to be able to generate unaligned vector loads/store
for memset/memmove, you have to add feature 'FeatureFastUAMem' to the
corresponding processor definition in x86.td.
Jaguar (and all the amdfam15) is expected to have slow unaligned memory
accesses.

This would work for memmove with fixed size. memset, is only expanded in the
case where value 0 is set; for that particular case (the most common), you
would get unaligned vector stores.

As a side note: the last time I tried running some micro-benchmarks on my
jaguar linux-box,  I noticed that a sequence of 'movq' actually go faster in
most cases.
So we really have to get accurate measures in this case.

If you add that feature, then you will eventually encounter the problem
reported by Sanjay. There are target specific dag combine rules that split
unaligned 32-byte vector load/store instructions into 16-byte vector load/store
if the target is AVX but not AVX2.
On Jaguar, this may hurt performances; I agree with Sanjay that we should
definitely measure the performances for this particular case.

Quuxplusone commented 9 years ago

Attached memtester.c (5474 bytes, text/plain): memory_test_harness.c

Quuxplusone commented 9 years ago

Attached memloops.s (4072 bytes, text/plain): memloops.s

Quuxplusone commented 9 years ago

Hi all -

The attached files are the source for a test program to try various perf experiments.

As-is the program will copy 10KB of data from a source buffer to a destination buffer and report the best-case bandwidth in GB/s for 100 iterations. This should fit comfortably in any recent x86 L1 AFAIK and hopefully overcome any transient behavior like loading the L1 or TLBs but still be long enough to measure accurately using rdtsc.

The program will iterate over a 32 x 32 set of alignment possibilities for the source and dest. See the header comments in the 'c' file test harness for how to reduce timing variability on Linux.

I welcome any and all suggestions to improve the test program. I'll post my perf results next.

Quuxplusone commented 9 years ago

Using a 1.5GHz Jaguar (HP A4-5000 based PC) running Ubuntu 14.04, here's what I'm seeing:

16-byte aligned source and dest using either 16-byte or 32-byte memops achieve the highest perf: ~33GB/s. This is using test loops 1 (16x4), 2 (16x8), or 3 (32x4). I think the theoretical peak for 1.5GHz Jaguar (16-byte interfaces to L1) would be 48GB/s, so it seems there's ~1 dead cycle for every read/write combo to the L1.
If we misalign either the source or dest buffers from 16-bytes, perf drops to ~20GB/s for test loops 1-3.
There's some quirk happening for test loop 4 (the 32x8 case) that's causing perf to be slightly lower than test loops 1-3. It's getting ~30GB/s on aligned and ~18GB/s on unaligned for me. I don't have an explanation for that yet.
Using any of the 8-byte (movq) test loops 5 - 7 only gives us ~14GB/s regardless of alignment, so there appears to be some other micro-arch bottleneck in play there.

Quuxplusone commented 9 years ago

So based on your data, Sanjay, the 32-byte vmovdqu gets the same perf as 16-byte. Since the 32-byte has less register pressure and code size, the 32-byte seems the best then?

Quuxplusone commented 9 years ago

Hi Sean,

Yes - based on the test program results, I would do 3 things:

Enable "FeatureFastUAMem" for btver2 to allow 16-byte memops. We apparently can't saturate the L1 bus using scalar memory accesses, so perf is much better with vectors. And as you noted, using vectors is better for code size and register pressure.
Rename "FeatureFastUAMem" to "FeatureFastUnalignedMem16". I didn't realize this was strictly a 16-byte vector feature, but based on its usage (grep for "isUnalignedMemAccessFast"), I think that's all it is.
Enable 32-byte unaligned loads for btver2. There is no obvious perf penalty for 32-byte vs. 16-byte unaligned loads using the test program. If #2 above is the right answer, then I would add another feature flag for this such as "FeatureFastUnalignedMem32" and use that in place of the current SandyBridge hack.

But before proceeding with any of that, I'm running some test-suite benchmarks to see if there's any gain or loss from #1 and #3. Andrea also mentioned that a previous experiment showed 8-byte 'movq' ran faster in some cases, so we need to understand that. If there are any internal benchmarks that anyone would like to try, that would also be welcome. There may be some tricky corner cases that just aren't accounted for in micro-benchmarks.

Quuxplusone commented 9 years ago

I have compared the results from running a subset of test-suite's SingleSource
programs on btver2, and I see almost no difference there.

Enabling 16-byte unaligned memops shows almost everything is within ±1% and the
geomean of all test results is within ±0.1%.

Enabling 32-byte unaligned memops results in a similar overall geomean, but
there are 2 clear winners:
Benchmarks/Linpack/linpack-pc
Benchmarks/Misc-C++/stepanov_v1p2

Both of those have tight loops of vector FP math dominating the profile, and in
both cases using 32-byte memops results in much better codegen for those loops.
When using 16-byte memops we get a bunch of extra vinsertf128/vextractf128
instructions in the hot loops. The perf win with 32-byte memops is about 7% on
each of those tests.

I know of one other clear winner from using 32-byte vs. 16-byte memops:
https://github.com/tycho/nbody

The "CPU_SOA_tiled" flavor of the test is about 20% slower on btver2 when using
16-byte memops compared to 32-byte memops.

Unless we can find some testcase that regresses with the wider memops enabled,
I think we should turn that codegen on for btver2.

Quuxplusone commented 9 years ago

Hi Sanjay,

Really sorry for this late reply..

Thanks for sharing your benchmark code!

A couple of comments related to your test:

1) feature flag FastUAMem will also affect memsets of constant size. So, we
should also verify memset loops.

2) Your benchmark only returns the best results. In our experiments, there is
always a variance in the results and it is interesting to know the average
throughput (and also the worse throughput - just to have an idea about the
variance).

That said, we can reproduce very similar results on our Linux Jaguar box. That
means, "warm" memcpy loops with unaligned vector stores are faster.
What I am worried about are "cold" memcpy loops. The reason why I said that
loops of movq may be slower in some cases is because unaligned vector
loads/stores seems particularly slow when the value is not in L1.

We have an internal benchmark (which we can share with you of course, it is
nothing special), that tests constant memset/memcpy's of various sizes using
data that is not in cache (the reason for the benchmark is that we were
interested in testing some other optimisations).

We added some of your test functions to our benchmark, and here are the
results.  As this was compiled with FastUAMem on, the memcpy's <= 256 are
inline vmovups.

CONSTANT               memcpy Value: N/A Size:    64 Took:  131948209 VALID
CONSTANT               memcpy Value: N/A Size:   128 Took:  132550393 VALID
CONSTANT               memcpy Value: N/A Size:   256 Took:  132165042 VALID
CONSTANT               memcpy Value: N/A Size:   512 Took:  157631110 VALID
CONSTANT               memcpy Value: N/A Size:  1024 Took:  155462556 VALID
CONSTANT               memcpy Value: N/A Size:  2048 Took:  141327448 VALID
CONSTANT               memcpy Value: N/A Size:  4096 Took:  138745694 VALID
CONSTANT               memcpy Value: N/A Size:  8192 Took:  136517355 VALID
CONSTANT               memcpy Value: N/A Size: 10240 Took:  135342292 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:    64 Took:   76494362 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:   128 Took:   79578415 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:   256 Took:   80829682 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:   512 Took:   81766766 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:  1024 Took:  141564112 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:  2048 Took:  139068987 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:  4096 Took:  138993287 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size:  8192 Took:  139679548 VALID
CONSTANT   copy_unaligned_8x8 Value: N/A Size: 10240 Took:  139760783 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:    64 Took:   78060565 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:   128 Took:  134957429 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:   256 Took:  134487626 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:   512 Took:  134463776 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:  1024 Took:  134063460 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:  2048 Took:  134021445 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:  4096 Took:  135024212 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size:  8192 Took:  133869065 VALID
CONSTANT  copy_unaligned_16x4 Value: N/A Size: 10240 Took:  133250724 VALID

0/27 TESTS INVALID
PASSED

The interesting results are in the case of 'copy_unaligned_8x8' (loop of movq)
which is faster in the range 128-512.  The test is simply the number of cycles
taken to copy a buffer of 64MB using a memcpy of size X (10 iterations).

These are instead the performance numbers from Spec CPU2006 (-O2 -march=btver2).

With feature FastUAMem:

---------------+-------+---------+
Benchmark      | ratio | runtime |
---------------+-------+---------+
464.h264ref    | 16.27 | 1360.52 |
462.libquantum | 21.19 |  977.95 | +0.89% faster
401.bzip2      |  8.17 | 1180.64 | +1.57% faster
403.gcc        |  8.52 |  944.76 |
470.lbm        | 16.54 |  830.88 |
453.povray     | 10.39 |  511.98 | +5.28% faster
450.soplex     | 10.88 |  766.24 | -1.31% slower
444.namd       |  9.49 |  845.18 |
483.xalancbmk  |  8.98 |  767.97 |
473.astar      |  5.63 | 1247.08 |
471.omnetpp    |  5.16 | 1211.53 | +2.83% faster
458.sjeng      |  9.46 | 1279.71 |
456.hmmer      | 13.09 |  712.90 |
445.gobmk      |  9.33 | 1124.45 |
429.mcf        |  4.96 | 1840.51 |
482.sphinx3    | 12.56 | 1551.51 |
433.milc       |  9.27 |  990.07 |
---------------+-------+---------+

Without feature FastUAMem:

---------------+-------+---------+
Benchmark      | ratio | runtime |
---------------+-------+---------+
464.h264ref    | 16.26 | 1360.88 |
462.libquantum | 21.00 |  986.72 |
401.bzip2      |  8.05 | 1199.24 |
403.gcc        |  8.57 |  939.36 |
470.lbm        | 16.52 |  831.62 |
453.povray     |  9.87 |  539.05 |
450.soplex     | 11.03 |  756.28 |
444.namd       |  9.49 |  844.73 |
483.xalancbmk  |  9.02 |  764.99 |
473.astar      |  5.65 | 1243.34 |
471.omnetpp    |  5.02 | 1245.86 |
458.sjeng      |  9.51 | 1271.92 |
456.hmmer      | 13.10 |  712.44 |
445.gobmk      |  9.21 | 1139.52 |
429.mcf        |  4.99 | 1826.35 |
482.sphinx3    | 12.55 | 1553.18 |
433.milc       |  9.30 |  987.11 |
---------------+-------+---------+

So, povray and omnetpp are faster if we enable that feature. The other
performance differences are mostly in the noise.

Quuxplusone commented 9 years ago

Hi Andrea,

Thanks for posting the results! There are certainly some mysteries in the memory hierarchy of btver2. :)

I would be very interested in trying your benchmark locally to see if I can pinpoint that perf difference in the movq loop. Can you post it or mail it to me?

Quuxplusone commented 9 years ago

(In reply to comment #12)
> Hi Andrea,
>
> Thanks for posting the results! There are certainly some mysteries in the
> memory hierarchy of btver2. :)

I definitely agree.

>
> I would be very interested in trying your benchmark locally to see if I can
> pinpoint that perf difference in the movq loop. Can you post it or mail it
> to me?

Sure, no problem.
Would it be ok if I share that code tomorrow? (it is quite late here and I was
about to leave the office...).

Quuxplusone commented 9 years ago

(In reply to comment #13)
> Would it be ok if I share that code tomorrow? (it is quite late here and I
> was about to leave the office...).

Sure - whenever you have a chance. Thanks!

Quuxplusone commented 9 years ago

For the record: Andrea, Simon, Rob, and I discussed the memcpy results in
comment 11. It turns out the perf variation was due to Linux rather than any
difference in 8-byte ops vs. 16-byte ops. Linux does not scale the CPUs up to
max frequency even when a process is demanding 100% CPU, so you have to force
it to get reliable perf numbers. We confirmed this on 2 different btver2
machines.

The 32-byte part of the bug is fixed with:
http://llvm.org/viewvc/llvm-project?view=revision&revision=222544

The 16-byte part of the bug is up for review here:
http://reviews.llvm.org/D6360

Quuxplusone commented 9 years ago

16-byte codegen for btver2 fixed with:
http://llvm.org/viewvc/llvm-project?view=revision&revision=222925

For the original code example in this bug report using clang built from
r223054, we now generate:

$ ./clang -O3 -fomit-frame-pointer -march=btver2 -c 21541.c -S -o -
    .section    __TEXT,__text,regular,pure_instructions
    .macosx_version_min 10, 10
    .globl  _copy32byte
    .align  4, 0x90
_copy32byte:                            ## @copy32byte
    .cfi_startproc
## BB#0:                                ## %entry
    vmovups (%rsi), %ymm0
    vmovups %ymm0, (%rdi)
    vzeroupper
    retq

------------------------------------------------------------------------

Resolving as fixed since we're using 32-byte memops now.

I've seen some codegen variability between "vmovups" and "vmovdqu" that I can't
explain yet. I don't think there will be any perf difference between those 2
insts for a simple copy based on my testing or the docs, but if there is, we
should open a new bug.

Quuxplusone / LLVMBugzillaTest

poor codegen for unaligned fixed-size memcpy/memmove #21540