Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM misses reciprocal estimate instructions in ISel on ARMv7/8 #27106

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR27107
Status NEW
Importance P normal
Reported by Steven Noonan (steven@uplinklabs.net)
Reported on 2016-03-28 23:08:58 -0700
Last modified on 2018-10-20 04:43:47 -0700
Version trunk
Hardware PC All
CC evandro.menezes@sifive.com, james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org, rengolin@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments arm-nbody.txt (6064 bytes, text/plain)
aarch64-nbody.txt (5885 bytes, text/plain)
Blocks
Blocked by
See also PR20900, PR21385
LLVM is missing the opportunity to use VRSQRTE/VRSQRTS (and the other
reciprocal estimate instructions):

$ cat rsqrt.c
#include <math.h>

float rsqrtf(float f)
{
        return 1.0f / sqrtf(f);
}

$ clang -O3 -mcpu=native -mfpu=neon -mfloat-abi=hard -ffast-math -S -o -
rsqrt.c | showasm

rsqrtf:                                 @ @rsqrtf
        vsqrt.f32       s0, s0
        vmov.f32        s2, #1.000000e+00
        vdiv.f32        s0, s2, s0
        bx      lr

Conversely, on x86_64, LLVM does the right thing:

$ clang -O3 -march=core-avx2 -ffast-math -S -o - rsqrt.c | showasm
rsqrtf:                                 # @rsqrtf
        vrsqrtss        %xmm0, %xmm0, %xmm1
        vmulss  %xmm1, %xmm1, %xmm2
        vfmadd213ss     .LCPI1_0(%rip), %xmm0, %xmm2
        vmulss  .LCPI1_1(%rip), %xmm1, %xmm0
        vmulss  %xmm0, %xmm2, %xmm0
        retq

It will even apply this properly to packed vectors if the inputs make sense for
it.

Right now the lack of reciprocal square root estimates on ARM breaks auto-
vectorization for a silly program I wrote, and the hand-written NEON intrinsics
version is beating the auto-vectorized variants (because the auto-vectorization
fails and everything gets bottlenecked around vsqrt+vdiv).

I looked at implementing this myself but got confused trying to understand
tablegen syntax. It looks like there just needs to be an ARMTargetLowering
implementation for TargetLowering::getRsqrtEstimate and
TargetLowering::getRecipEstimate.
Quuxplusone commented 8 years ago
Hi Steven,

Thanks for raising this. Actually, just emitting the rsqrt instructions isn't
the hard part of this. The hard part is deciding when they are profitable.

For a single rsqrtf outside of a loop (such as your rsqrtf() function) it is
almost always more efficient to use the sqrt instruction. Note that our
lowering isn't very good at the moment: we're doing:

  1.0 / sqrt(x)

What we should be doing is this:

  (1.0 / x) * sqrt(x)

(because 1/sqrt(x) == sqrt(x) / x). This would make the division and square
root doable in parallel.

Now, rsqrt[es]'s advantage is that they don't block the pipeline (because they
are not iterative instructions), so throughput is increased. However, latency
is also increased as more instructions have to be executed. So swapping a sqrt
for a rsqrt[es] sequence will often make code *slower* as in most cases sqrts,
rsqrts and divisions are on the critical path.

The main case where rsqrt will give a heavy speedup is in a loop where there
are no dependencies on the result of the sqrt. For example a trivial loop where
we calculate the sqrt of all members of an array independently:

  for (i : ...) {
    b[i] = rsqrt(a[i]); // Fast with rsqrte
  }

Because it doesn't matter how long the rsqrt takes - there's nothing dependent
on its result. Throughput is key in this loop.

The only example I've seen in practice of this kind of loop is vector
normalization:

  for (v : all_vectors) {
    vector_magnitude[v] = sqrt(a[v].x * a[v].x + a[v].y * a[v].y + a[v].z * a[v].z);
  }

I haven't seen any other examples in the wild where rsqrt would unequivocally
improve performance.

I'm interested: other than your trivial testcase, have you seen code where
rsqrt will improve performance? Would you mind sharing that?

The other issue is that the tradeoff between sqrt and rsqrt is heavily
microarchitectural, which means a lot of benchmarking.

If we wanted to do this, we'd probably be best doing it as a MachineCombiner
peephole, as this is a critical path / resource height issue.

Cheers,

James
Quuxplusone commented 8 years ago
Hi James,

Thanks for your detailed reply!

I've heard the latency/throughput arguments about reciprocals before, and I
understand it's not simple to evalute the costs/benefits. There is already a
facility to allow developers to opt in to specific fastmath reciprocal
estimates: -mrecip. Right now it just doesn't do anything on ARM. Making the
compiler capable of emitting the reciprocal estimate instructions seems
orthogonal to making it the default behavior, and the latter question can be
evaluated more carefully later down the line. :)

And to answer your question, no I don't have a specific "real" program that
would benefit from it. I'm sure there are real programs that do vector
normalization, but I'm not aware of them (other than Quake, hah).

The test case in question is a physics simulation (n-body). You can find the
sources here, if you'd like to try it out (build with 'make CC="clang"
NO_CUDA=1' or similar):

https://github.com/tycho/nbody

It relies on OpenMP for parallelism (optionally, disabled with NO_OPENMP=1) and
on compiler auto-vectorization for per-thread throughput.

Right now the non-intrinsic variants are horribly slow (Cortex-A15 used in this
example):

$ ./nbody --bodies 32 --no-crosscheck --iterations 1 --cycle-after 3
Running simulation with 32768 particles, crosscheck disabled, CPU enabled, 4
threads
      CPU_SOA:  4887.12 ms =  219.708x10^6 interactions/s (     4.39 GFLOPS)
      CPU_SOA:  4787.85 ms =  224.264x10^6 interactions/s (     4.49 GFLOPS)
      CPU_SOA:  4782.07 ms =  224.535x10^6 interactions/s (     4.49 GFLOPS)
CPU_SOA_tiled:  4689.76 ms =  228.954x10^6 interactions/s (     4.58 GFLOPS)
CPU_SOA_tiled:  4695.57 ms =  228.671x10^6 interactions/s (     4.57 GFLOPS)
CPU_SOA_tiled:  4692.97 ms =  228.798x10^6 interactions/s (     4.58 GFLOPS)
  NEON intrin:  1637.50 ms =  655.721x10^6 interactions/s (    13.11 GFLOPS)
  NEON intrin:  1607.94 ms =  667.774x10^6 interactions/s (    13.36 GFLOPS)
  NEON intrin:  1615.91 ms =  664.481x10^6 interactions/s (    13.29 GFLOPS)
      CPU_AOS:  5049.33 ms =  212.650x10^6 interactions/s (     4.25 GFLOPS)
      CPU_AOS:  5047.26 ms =  212.738x10^6 interactions/s (     4.25 GFLOPS)
      CPU_AOS:  5047.67 ms =  212.720x10^6 interactions/s (     4.25 GFLOPS)
CPU_AOS_tiled:  5105.87 ms =  210.296x10^6 interactions/s (     4.21 GFLOPS)
CPU_AOS_tiled:  5110.79 ms =  210.093x10^6 interactions/s (     4.20 GFLOPS)
CPU_AOS_tiled:  5108.08 ms =  210.205x10^6 interactions/s (     4.20 GFLOPS)

The bottleneck in the non-NEON implementations is this:

static inline float rsqrtf(float f)
{
    return 1.0f / sqrtf(f);
}

static void bodyBodyInteraction(
    float *fx, float *fy, float *fz,
    float x0, float y0, float z0,
    float x1, float y1, float z1, float mass1,
    float softeningSquared)
{
    float dx = x1 - x0;
    float dy = y1 - y0;
    float dz = z1 - z0;

    float distSqr = dx*dx + dy*dy + dz*dz;
    distSqr += softeningSquared;

    float invDist = rsqrtf(distSqr); // prevents auto-vectorization from doing its job

    float invDistCube =  invDist * invDist * invDist;
    float s = mass1 * invDistCube;

    *fx = dx * s;
    *fy = dy * s;
    *fz = dz * s;
}

It's interesting to note that Clang's codegen for SOA/SOA_tiled outperform the
AVX intrinsics variant on x86_64:

$ ./nbody --bodies 128 --no-crosscheck --iterations 1 --cycle-after 3
Running simulation with 131072 particles, crosscheck disabled, CPU enabled, 32
threads
      CPU_SOA:   759.61 ms =   22.617x10^9 interactions/s (   452.33 GFLOPS)
      CPU_SOA:   748.76 ms =   22.944x10^9 interactions/s (   458.89 GFLOPS)
      CPU_SOA:   756.07 ms =   22.723x10^9 interactions/s (   454.45 GFLOPS)
CPU_SOA_tiled:   747.42 ms =   22.986x10^9 interactions/s (   459.71 GFLOPS)
CPU_SOA_tiled:   747.76 ms =   22.975x10^9 interactions/s (   459.50 GFLOPS)
CPU_SOA_tiled:   750.04 ms =   22.905x10^9 interactions/s (   458.11 GFLOPS)
   AVX intrin:   856.90 ms =   20.049x10^9 interactions/s (   400.98 GFLOPS)
   AVX intrin:   851.30 ms =   20.181x10^9 interactions/s (   403.62 GFLOPS)
   AVX intrin:   856.85 ms =   20.050x10^9 interactions/s (   401.00 GFLOPS)
      CPU_AOS:  2195.24 ms =    7.826x10^9 interactions/s (   156.52 GFLOPS)
      CPU_AOS:  2195.59 ms =    7.825x10^9 interactions/s (   156.49 GFLOPS)
      CPU_AOS:  2218.18 ms =    7.745x10^9 interactions/s (   154.90 GFLOPS)
CPU_AOS_tiled:  2099.47 ms =    8.183x10^9 interactions/s (   163.66 GFLOPS)
CPU_AOS_tiled:  2099.55 ms =    8.183x10^9 interactions/s (   163.65 GFLOPS)
CPU_AOS_tiled:  2310.61 ms =    7.435x10^9 interactions/s (   148.70 GFLOPS)
Quuxplusone commented 8 years ago
For reference, linking the x86 recip/rsqrt codegen bugs:
bug 20900
bug 21385
Quuxplusone commented 8 years ago

Evandro, this is related to your reverted patch, and Jojo was having a look, too. I think we should sync to make sure this works without breaking LTO.

We can try this on ARM while you try on AArch64, or we can both try on different AArch64 first, then port down to ARM, whatever makes the collaboration easier.

We have access to AMD (A57), APM (XGene) and HiKeys (A53). James has some Junos that maybe we can run some tests later on.

Quuxplusone commented 8 years ago
Another link for reference:
https://reviews.llvm.org/D24816

I've proposed to make reciprocal estimate codegen a function-level attribute as
a work-around for the LTO problem.

There are a lot of different topics intertwined in that discussion including
the backend's reliance on global settings. Depending on how you set the target
defaults for ARM and/or AArch64, the -mrecip interface between clang and LLVM
may or may not be relevant.
Quuxplusone commented 8 years ago

Attached arm-nbody.txt (6064 bytes, text/plain): nbody on ARM

Quuxplusone commented 8 years ago

Attached aarch64-nbody.txt (5885 bytes, text/plain): nbody on AArch64

Quuxplusone commented 8 years ago
Status update:

1. Plan for this task:
   a. try to emit rsqrt.
   b. Benchmarking on two strategies.
   c. Making the final decision.

2. Some preliminary tests:
   a. running nbody on ARM(dragon) base on patch:
      https://reviews.llvm.org/D24911 (It's basically the same as Evandro's patch,not saying it's the final patch.)
   b. running nbody on AArch64(amd) base on Evandro's patch.

  In the tests,I just opened the supporting for operations in specific type at a time. For now,we can see performance improvements only in v4f32 both on ARM & AArch64,and performance improvements in f32 for "CPU_SOA_tiled" on ARM.The details can be seen in the attachments.

No definite conclusion.
TODO: More investigation.