Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

optimize reciprocals with fast-math (x86) #21384

Open Quuxplusone opened 9 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR21385
Status REOPENED
Importance P normal
Reported by Sanjay Patel (spatel+llvm@rotateright.com)
Reported on 2014-10-25 17:57:27 -0700
Last modified on 2018-05-05 08:41:57 -0700
Version trunk
Hardware PC All
CC hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, rnk@google.com, steven@uplinklabs.net
Fixed by commit(s)
Attachments recip_accuracy_tester.c (3362 bytes, text/plain)
rsqrt_accuracy_tester.c (3255 bytes, text/plain)
sqrt_accuracy_tester.c (3588 bytes, text/x-csrc)
Blocks
Blocked by
See also PR20900, PR24063, PR27107, PR37343
This is the sibling to bug 20900 (square root estimates).

We could be generating RCPSS / RCPPS instead of divisions when -ffast-math
allows it.

I thought this would be simpler than square root, but we'll need to again
measure the performance gains per CPU to decide where to enable it, and we may
need to offer a 2nd Newton-Raphson refinement implementation for x86 that lacks
FMA.

I've also written a test program (below) that tries to confirm the goodness of
the estimate. It shows that one N-R step is inadequate to get us to 1 ULP of
the IEEE result on both Intel Sandy Bridge and AMD Jaguar.

Most unfortunately, we get the wrong answer for a reciprocal of 1.0f! This is
why RCPPS isn't generated with gcc -ffast-math.

Both Intel and AMD manuals say that the estimate instruction provides almost 12-
bits of accuracy, and one N-R step should double that accuracy to 24-bits which
is good enough for a single-precision float, but no...that "almost 12-bits" is
definitely not good enough. We need two N-R steps.

Test program:

#include <xmmintrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
float recip_ieee(float x) { return 1.0f/x; }

float recip_est(float x) {
    float est;
    __m128 vecX = { x,0,0,0 };
    __m128 vecEst = _mm_rcp_ss(vecX);
        _mm_store_ss(&est, vecEst);
    return est;
}

float recip_est_NR1(float x) {
    float est;
    __m128 vecX = { x,0,0,0 };
    __m128 vecEst = _mm_rcp_ss(vecX);
        _mm_store_ss(&est, vecEst);

    est = est + est * (1 - est * x);
    return est;
}

float recip_est_NR2(float x) {
    float est;
    __m128 vecX = { x,0,0,0 };
    __m128 vecEst = _mm_rcp_ss(vecX);
        _mm_store_ss(&est, vecEst);

    est = est + est * (1 - est * x);
    est = est + est * (1 - est * x);
    return est;
}

int main() {
    int i;
    unsigned int tests = 0;
    unsigned int not_exact_NR1 = 0;
    unsigned int not_exact_NR2 = 0;
    unsigned int bad_NR1 = 0;
    unsigned int bad_NR2 = 0;
    // ignore denorm inputs (anything less than 0x00800000)
    // ignore denorm outputs (any input over 0x7e800000)
    // ignore inf (0x7f800000)
    // ignore nans (anything more than 0x7f800000)
    // ignore negative (anything more than 0x80000000)
    for (i = 0x00800000; i < 0x7e800000; i++) {
        float f, ieee, est, est_nr1, est_nr2;
        int ieee_int, est_int, est_nr1_int, est_nr2_int;

        tests++;
        memcpy(&f, &i, 4);
        ieee = recip_ieee(f);
//      est = recip_est(f);
        est_nr1 = recip_est_NR1(f);
        est_nr2 = recip_est_NR2(f);
        memcpy(&ieee_int, &ieee, 4);
//      memcpy(&est_int, &est, 4);
        memcpy(&est_nr1_int, &est_nr1, 4);
        memcpy(&est_nr2_int, &est_nr2, 4);

//      if (i % 0x01000000 == 0) printf("0x%08x\n", i); // just a status print...

        if (ieee != est_nr1) {

            not_exact_NR1++;
            if (abs(ieee_int - est_nr1_int) > 1) {
                bad_NR1++;
//              printf("0x%08x: %e, ieee = %e (0x%08x), est_nr1 = %e (0x%08x), est_nr2 =
%e (0x%08x)\n",
//                   i, f, ieee, ieee_int, est_nr1, est_nr1_int, est_nr2, est_nr2_int);
            }
        }
        if (ieee != est_nr2) {
            not_exact_NR2++;
            if (abs(ieee_int - est_nr2_int) > 1) {
                bad_NR2++;
            }
        }
    }
    printf("Total tests = %d\n", tests);
    printf("Inexact results with one N-R step = %d\n", not_exact_NR1++);
    printf("Results off by > 1 ULP with one N-R step = %d\n", bad_NR1);
    printf("Inexact results with two N-R steps = %d\n", not_exact_NR2++);
    printf("Results off by > 1 ULP with two N-R steps = %d\n", bad_NR2);
    return 0;
}
Quuxplusone commented 9 years ago
Test program output on Sandy Bridge:

Total tests = 2113929216
Inexact results with one N-R step = 657670214
Results off by > 1 ULP with one N-R step = 10207460
Inexact results with two N-R steps = 353480402
Results off by > 1 ULP with two N-R steps = 0
Quuxplusone commented 9 years ago
Test program output on Jaguar:

Total tests = 2113929216
Inexact results with one N-R step = 646439942
Results off by > 1 ULP with one N-R step = 5167130
Inexact results with two N-R steps = 374437446
Results off by > 1 ULP with two N-R steps = 0
Quuxplusone commented 9 years ago
On second thought...I wrote a similar test loop for the rsqrt estimate, and we
don't get within 1 ULP in many cases, but that estimate is apparently generally
accepted.

This leads to the more general question: how accurate does fast-math need to be
relative to IEEE math?

The real reason that gcc differs on rsqrt and recip optimizations has to do
with passing a SPEC test rather than accuracy:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1
Quuxplusone commented 9 years ago
(In reply to comment #3)
> On second thought...I wrote a similar test loop for the rsqrt estimate, and
> we don't get within 1 ULP in many cases, but that estimate is apparently
> generally accepted.

For "fast math" we can probably do with 1-2 ulps. This, of course, is also
really application dependent.

>
> This leads to the more general question: how accurate does fast-math need to
> be relative to IEEE math?
>
> The real reason that gcc differs on rsqrt and recip optimizations has to do
> with passing a SPEC test rather than accuracy:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1

We should likely make this configurable (at least via a command-line option).
How much does the performance change if you use two?
Quuxplusone commented 9 years ago

Attached recip_accuracy_tester.c (3362 bytes, text/plain): recip_accuracy_tester.c

Quuxplusone commented 9 years ago

Attached rsqrt_accuracy_tester.c (3255 bytes, text/plain): rsqrt_accuracy_tester.c

Quuxplusone commented 9 years ago
Before getting to the perf question, I want to make sure we understand the
accuracy angle.

Here are the results from the attached test loops for recip and rsqrt running
on Sandy Bridge:

Reciprocal estimate:

Total tests = 2113929216

Inexact results with one N-R step = 657670214
   Estimate missed by 1 ULP: 647462754
   Estimate missed by 2 ULP: 10186690
   Estimate missed by 3 ULP: 20770
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 353480402
   Estimate missed by 1 ULP: 353480402
   Estimate missed by 2 ULP: 0
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0

Rsqrt estimate:

Total tests = 2130706432

Inexact results with one N-R step = 886838505
   Estimate missed by 1 ULP: 830844699
   Estimate missed by 2 ULP: 53541827
   Estimate missed by 3 ULP: 2399100
   Estimate missed by 4 ULP: 52879
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 758708881
   Estimate missed by 1 ULP: 742129840
   Estimate missed by 2 ULP: 16578375
   Estimate missed by 3 ULP: 666
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
Quuxplusone commented 9 years ago
And the same programs running on Jaguar:

Reciprocal estimate:

Total tests = 2113929216

Inexact results with one N-R step = 646439942
   Estimate missed by 1 ULP: 641272812
   Estimate missed by 2 ULP: 5161649
   Estimate missed by 3 ULP: 5481
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 374437446
   Estimate missed by 1 ULP: 374437446
   Estimate missed by 2 ULP: 0
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0

Rsqrt estimate:

Total tests = 2130706432

Inexact results with one N-R step = 793771474
   Estimate missed by 1 ULP: 773089565
   Estimate missed by 2 ULP: 20629077
   Estimate missed by 3 ULP: 52832
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 723046956
   Estimate missed by 1 ULP: 714982537
   Estimate missed by 2 ULP: 8064404
   Estimate missed by 3 ULP: 15
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
Quuxplusone commented 9 years ago

So Jaguar's estimate function looks better, particularly for rsqrt. It doesn't miss either estimate by more than 3 ULPs using one refinement step. Sandy Bridge only gets to within 4 ULPs on rsqrt for some cases.

I think the good news is that at least we're within 4 ULPs in all cases.

If we judge the recip estimate purely on this accuracy metric, the single refinement result for it is better than what we (and other compilers) already accept as good enough for rsqrt (where we use just one refinement step).

Quuxplusone commented 9 years ago
There are at least two additional variables that we may want to consider:

1. The implementation of the refinement algorithm matters. For example, we can
save an fmul on the reciprocal N-R step by coding the N-R equation as:

              est1 = est0 * (2.0f - est0 * f);

instead of:
              est1 = est0 + est0 * (1.0f - est0 * f);

But this leads to a less accurate final answer according to:
http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods

And testing on Jaguar appears to confirm that:

Total tests = 2113929216

Inexact results with one N-R step = 833587524
   Estimate missed by 1 ULP: 819988344
   Estimate missed by 2 ULP: 13579776
   Estimate missed by 3 ULP: 19404
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 617003604
   Estimate missed by 1 ULP: 616978404
   Estimate missed by 2 ULP: 25200
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0

The total misses are much higher, but we actually have slightly fewer 3 ULP
misses with the single refinement step. Probably not worth saving a flop for
that?

2. Everything should be retested on a machine (like Haswell) that has FMA. The
added precision of FMA should mean better results, and it's possible that the
initial estimate has improved too (certainly hope it hasn't gotten worse).
Quuxplusone commented 9 years ago
(In reply to comment #10)
> The total misses are much higher, but we actually have slightly fewer 3 ULP
> misses with the single refinement step. Probably not worth saving a flop for
> that?

Disregard - too many data points! I was looking at the Sandy Bridge number
(20770) rather than the Jaguar number (5481). Using the longer equation appears
to produce better results in all head-to-head categories for Jaguar.
Quuxplusone commented 9 years ago
Now back to the perf question. Let me focus just on Jaguar for the moment
because we'll want to enable this per-CPU for x86 as we did with rsqrt and
having a feature flag for this codegen provides an easy opt-out or -in for
users.

The divss instruction (IEEE-compliant) takes 14 cycles (not pipelined of
course).
The rcpss instruction alone takes 2 cycles.
The rcpss inst + one NR step using mul/sub/mul/add takes 12 cycles.
The rcpss inst + two NR steps using the same mul/sub/mul/add sequence takes 22
cycles.

The breakdown of the refinement cost is: 2 cycles for mul, 3 cycles for add/sub
(how this chip can multiply faster than add is a different story :) ). For the
4 instruction N-R implementation, all ops are pipelined but dependent, so we
just add up the individual cycle counts.

So we have a clear win for using a single N-R step with -ffast-math: it's
faster in raw latency and 14x faster for throughput because all instructions
are pipelined. Using 2 N-R steps means we have to weigh the latency vs.
throughput scenarios.
Quuxplusone commented 9 years ago
The final question - at least in my mind - is: what is the output around 1.0f?
gcc cited that as part of the reason for not enabling recip with -ffast-math,
but that doesn't match the internal discussion or the actual behavior; they
*do* generate rcpps (vector) with -ffast-math, so not generating rcpss (scalar)
for accuracy reasons does not follow. The 2 instructions should have identical
output on all HW.

input 0x3f7fffff: 9.999999e-01
ieee    = 1.000000e+00 (0x3f800001)
est_nr1 = 1.000000e+00 (0x3f800000)
est_nr2 = 1.000000e+00 (0x3f800000)

input 0x3f800000: 1.000000e+00
ieee    = 1.000000e+00 (0x3f800000)
est_nr1 = 9.999999e-01 (0x3f7fffff)
est_nr2 = 1.000000e+00 (0x3f800000)

input 0x3f800001: 1.000000e+00
ieee    = 9.999999e-01 (0x3f7ffffe)
est_nr1 = 9.999998e-01 (0x3f7ffffd)
est_nr2 = 9.999999e-01 (0x3f7ffffe)

Using one N-R step, we're off by 1 ULP in all 3 cases. The second N-R step
corrects the second and third cases to match the IEEE result.

This is using the 4 op N-R sequence on Jaguar.

So I agree with Hal's statement in comment 4 that we should offer the number of
N-R refinement steps as a parameter to the user. They can weigh the perf vs.
accuracy trade-off per application. I would default to one N-R step.
Quuxplusone commented 9 years ago
Patch proposal for review:
http://reviews.llvm.org/D6175
Quuxplusone commented 9 years ago
Allow rcp* codegen for btver2 checked in with:
http://reviews.llvm.org/rL221706

I still need to expose the number of refinement steps as a command-line option.

We may also want to offer multiple N-R implementations as an option (as we did
for rsqrt).
Quuxplusone commented 9 years ago
For the record, I ran the attached reciprocal test program on Haswell (AVX2),
and it looks like Intel did not alter the estimate function vs. Sandy Bridge.
These are identical results to SB:

$ ./recip_avx2_no_fma
Total tests = 2113929216

Inexact results with one N-R step = 657670214
   Estimate missed by 1 ULP: 647462754
   Estimate missed by 2 ULP: 10186690
   Estimate missed by 3 ULP: 20770
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 353480402
   Estimate missed by 1 ULP: 353480402
   Estimate missed by 2 ULP: 0
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0

But when we use FMA for the N-R refinement (the 4-step implementation just
becomes 2 FMAs), we get much more accurate results:

$ ./recip_avx2_fma
Total tests = 2113929216

Inexact results with one N-R step = 331401168
   Estimate missed by 1 ULP: 330208704
   Estimate missed by 2 ULP: 1192464
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 252
   Estimate missed by 1 ULP: 252
   Estimate missed by 2 ULP: 0
   Estimate missed by 3 ULP: 0
   Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
Quuxplusone commented 9 years ago

With FMA3 or FMA4 enabled are the FMA instructions getting used in the fast-math NR iteration automatically or is it something that has to be explicitly set?

Quuxplusone commented 9 years ago
(In reply to comment #17)
> With FMA3 or FMA4 enabled are the FMA instructions getting used in the
> fast-math NR iteration automatically or is it something that has to be
> explicitly set?

FMA should be automatically generated when -ffast-math is used.

For the test program, I made sure I wasn't replacing the actual div
instructions or using any other FP shortcuts by compiling with:

$ clang recip_accuracy_tester.c -O2 -march=core-avx2 -ffp-contract=fast
Quuxplusone commented 9 years ago
Refinement accuracy made configurable via cl::opt:
http://llvm.org/viewvc/llvm-project?view=revision&revision=221819

With that, I think we can resolve this bug as fixed.

The customer can control generation of rcpps/rcpss with "-use-recip-est" and
control the speed vs. accuracy trade-off with "-x86-recip-refinement-steps".
Quuxplusone commented 9 years ago

Is it worth adding a test that checks that FMA instructions are being used on FMA3/FMA4 targets?

Quuxplusone commented 9 years ago

I have no context for any of the floating point precision stuff, but is a backend flag really the right way to expose this? I thought we were about to launch a witch hunt to remove all these flags that create hidden global state.

Are you sure you don't want an instruction, module, or function flag?

Quuxplusone commented 9 years ago

Hi Reid,

I wasn't aware that we were trying to get rid of these, so I was using the existing cl::opt code that I've seen in the vectorizers and backend as a model.

Certainly, if there's a better way, I'll change this. Is there an existing param that you recommend using as a template for this sort of code?

Quuxplusone commented 9 years ago
I'd like to reopen this issue to ask why FeatureUseSqrtEst and
FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable benefits
on many if not all of the other x86 microarchitectures. Is there a reason it
cannot be enabled more broadly?

I'm not sure what GCC's criteria is for enabling it, but I've seen reciprocal
square root estimate enabled on every x86 -march= I know of whenever -ffast-
math is specified and SSE is available. For example, even as far back as -
march=pentium3 rsqrtss is used:

float rsqrtf(float f)
{
        return 1.0f / sqrtf(f);
}

$ gcc -m32 -O3 -ffast-math -mfpmath=sse -march=pentium3 -S -o - rsqrt.c
[...]
rsqrtf:

        subl    $4, %esp
        rsqrtss 8(%esp), %xmm1
        movss   8(%esp), %xmm0
        mulss   %xmm1, %xmm0
        mulss   %xmm1, %xmm0
        mulss   .LC1, %xmm1
        addss   .LC0, %xmm0
        mulss   %xmm1, %xmm0
        movss   %xmm0, (%esp)
        flds    (%esp)
        popl    %eax
        ret
.LC0:
        .long   3225419776
.LC1:
        .long   3204448256

GCC does not, however, emit reciprocal estimates, even with -march=haswell.
It's possible that GCC does not implement any selection of RCPSS:

float recipf(float f)
{
        return 1.0f / f;
}

$ gcc -O3 -ffast-math -mfpmath=sse -march=haswell -S -o - recip.c
[...]
        vmovss  .LC0(%rip), %xmm1
        vdivss  %xmm0, %xmm1, %xmm0
        ret
.LC0:
        .long   1065353216

At the very least I'd like to see reciprocal square root estimates added by
default on x86 for -ffast-math. How can we get such a change implemented? Is it
a matter of building confidence in the safety and benefit of such a change?
Quuxplusone commented 9 years ago
(In reply to comment #23)
> I'd like to reopen this issue to ask why FeatureUseSqrtEst and
> FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable
> benefits on many if not all of the other x86 microarchitectures. Is there a
> reason it cannot be enabled more broadly?
>
...
>
> At the very least I'd like to see reciprocal square root estimates added by
> default on x86 for -ffast-math. How can we get such a change implemented? Is
> it a matter of building confidence in the safety and benefit of such a
> change?

I don't think that the accuracy concerns are particularly significant (even if
the estimates are slightly less accurate than the AMD implementations). In
reality, I think that someone just needs to do the benchmarking on these
systems to show that it is worthwhile.

I think this can all be done with command-line args at this point with the in-
tree code.
Quuxplusone commented 9 years ago
(In reply to comment #24)
> (In reply to comment #23)
> > I'd like to reopen this issue to ask why FeatureUseSqrtEst and
> > FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable
> > benefits on many if not all of the other x86 microarchitectures. Is there a
> > reason it cannot be enabled more broadly?
...
> I don't think that the accuracy concerns are particularly significant (even
> if the estimates are slightly less accurate than the AMD implementations).
> In reality, I think that someone just needs to do the benchmarking on these
> systems to show that it is worthwhile.
>
> I think this can all be done with command-line args at this point with the
> in-tree code.

That's correct. And I agree that both rcp and rsqrt can provide significant
throughput perf benefits on most (all?) recent x86 chips.

But I didn't want to overstep with the initial patch, so I proposed and checked
in the changes to enable these by default on Jaguar only because they clearly
provide a throughput *and* latency win there. The only way that I can think of
where using the estimates could cause a loss on Jaguar is via increased
register pressure/usage.

Each of rsqrt and rcp codegen can be enabled via a subtarget feature flag, so
someone just needs to run benchmarks that they care about on chips that they
care about to change the default settings.

Between the number of N-R iterations, N-R implementation, FMA, and HW accuracy,
there are many permutations of the codegen that should still be tested. Some of
that experimenting will require changing code in LLVM, but I think all of the
info and tools needed for that are in this bug report. I can provide
assistance, but that work is out of my current scope.
Quuxplusone commented 9 years ago

I'm not sure I'm the best person to come up with a set of tests for this issue, but if someone can design some tests I have a bunch of Intel hardware around: NHM, WSM, SBR, IVB, and HSW. Not sure I have anything older than Nehalem, but I'm not sure anyone cares about code generation on ancient hardware.

From my testing using my n-body test, I've found the performance benefits are remarkable, and the accuracy when comparing iterations between '-O3 -ffast-math' and '-O3 -mfpmath=387' showed that the margin of error was incredibly small. I'll try to characterize these two statements with Hard Facts(tm) once I re-run and record some data. But this test alone is probably not enough to satisfy the questions regarding codegen diversity.

Quuxplusone commented 9 years ago
(In reply to comment #23)
> GCC does not, however, emit reciprocal estimates, even with -march=haswell.
> It's possible that GCC does not implement any selection of RCPSS:

This is explained by:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1

Using rcpss broke SPEC2006, so they decided to hide that codegen behind another
flag. I don't think we want to play that kind of benchmarking game with LLVM.
Quuxplusone commented 9 years ago

Ugh. Is there a reason we didn't see precision issues with rcpss in your testing? Microarchitecture differences? Missed some edge cases?

If rcpss breaks real-world code then that's one thing. We can't test every possible code sequence under the sun or we'd never get anything done, but we can run it through smoke testing. And we can always revert a change if it breaks something.

I'd like to figure out a sane criteria for enabling these optimizations. How can we proceed? (I'm hoping for specific steps here because I don't want to see this fall by the wayside.)

Quuxplusone commented 9 years ago
(In reply to comment #28)
> Ugh. Is there a reason we didn't see precision issues with rcpss in your
> testing? Microarchitecture differences? Missed some edge cases?
>
> If rcpss breaks real-world code then that's one thing.

I think you need to be careful about setting criterion based on "break[ing]
real-world code" when talking about precision-reducing optimizations like this.
No matter what accuracy threshold you set, you'll always find something you
break.

> We can't test every
> possible code sequence under the sun or we'd never get anything done, but we
> can run it through smoke testing. And we can always revert a change if it
> breaks something.
>
> I'd like to figure out a sane criteria for enabling these optimizations. How
> can we proceed? (I'm hoping for specific steps here because I don't want to
> see this fall by the wayside.)

Sanjay has posted his accuracy testers, and running those would be good. I
recommend testing speed with an N-body code, or similar kernel. That should be
sufficient.
Quuxplusone commented 9 years ago
(In reply to comment #29)
> I think you need to be careful about setting criterion based on "break[ing]
> real-world code" when talking about precision-reducing optimizations like
> this. No matter what accuracy threshold you set, you'll always find
> something you break.

Exactly - if you review the data in comment 7 and comment 8, you'll see that
rsqrt estimates actually have worse accuracy using a single N-R step than rcp
estimates (relative to the IEEE answer). But GCC lets that go because it
doesn't affect SPEC.

Every app will have a different tolerance for FP accuracy, so we're allowing
the user to turn that codegen on/off, and in the case of rcp, we even have a
knob (although poorly implemented with a cl::opt) to change the number of N-R
steps. I'd like to add a similar N-R step param for rsqrt, but I was holding
off pending some kind of resolution for how to do that better (see: bug 21471
and http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078785.html).
Quuxplusone commented 9 years ago

Isn't hiding this behind -ffast-math already a sufficient "use at your own risk" caveat?

Quuxplusone commented 9 years ago
(In reply to comment #31)
> Isn't hiding this behind -ffast-math already a sufficient "use at your own
> risk" caveat?

Sure, but -ffast-math is a big hammer. We don't want users to have to turn that
off entirely just because the estimate codegen wasn't accurate or fast enough.
We're providing fine grain tuning with the other flags (see also bug 20912).

There are 2 issues though, so let's try to separate this:

(1) Is the accuracy good enough to enable this by default?

(2) Is the performance better enough to enable this by default?

I think I've already answered (1) with a "yes" when using one N-R step. The
data for Jaguar, Sandy Bridge, and Haswell is in this bug report. SPEC be
damned. This is just restating what Hal said in comment 24.

If one N-R step isn't good enough (or the app can tolerate even less accurate
results), the user can change that setting.

For (2), we already know that latency for an estimate sequence won't be as good
on the Intel chips but throughput will improve.

We should determine exactly how many parallel calcs are needed to make the
estimate code perform better. I think that answer will be just 2 or 3...that
seems like a low enough cross-over that we would then feel justified in
enabling the codegen for the chip in question. Ie, we wouldn't expect most real
world code to be perf-limited by the latency of 1 or 2 div/sqrt ops.
Quuxplusone commented 9 years ago

Some naive questions:

  1. Aren't the instruction scheduling definitions in lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td sufficient for LLVM to make smart decisions about whether or not the latency cost of the reciprocal square root estimate is worth it?

  2. It sounds like people would be satisfied with only using the reciprocal square root estimate by default in vector operations. That is, there's not sufficient justification to enable it for scalar operations because of the latency impact. Is that correct? If so, what would need to change to make it only enable the rsqrt estimate for vector operations?

Quuxplusone commented 9 years ago
(In reply to comment #33)
>
> 1. Aren't the instruction scheduling definitions in
> lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td
> sufficient for LLVM to make smart decisions about whether or not the latency
> cost of the reciprocal square root estimate is worth it?

I agree that the sched defs should provide more info to make a decision like
this, and I left FIXME comments in X86TargetLowering::getRsqrtEstimate() and
X86TargetLowering::getRecipEstimate() about that.

But when I looked into it, I didn't see a way to get that info into
X86TargetLowering. It's entirely possible that I just overlooked some API to
make it happen.

Note that the vectorizers don't use the sched timings either. They use an
alternate cost model (bug 21356, comment 5)...I don't know why we have this
other cost model.

The harder problem, as I wrote in the comments, is that Intel chips (and
AMD's?) have logic in the FPU HW that makes div/sqrt timing variable based on
the values of the operands. I don't see any way to account for that statically.
We might be able to detect some simple cases with some kind of FP value
tracking (see bug 17713), but there's no way to solve that in general.
Quuxplusone commented 9 years ago
(In reply to comment #33)
> 2. It sounds like people would be satisfied with only using the reciprocal
> square root estimate by default in vector operations. That is, there's not
> sufficient justification to enable it for scalar operations because of the
> latency impact. Is that correct? If so, what would need to change to make it
> only enable the rsqrt estimate for vector operations?

No - the vector versions of the real instructions and the estimate instructions
tend to have the same latency/throughput as the scalar versions. (There are
special cases like 256-bit vectors on Jaguar, where everything has to be double-
pumped for 256-bit vectors.)

So in general, we have the same issue with vectors as scalars. As long as you
can show that N pipelined executions of the vector estimate sequences are
faster than their full-precision counterparts, then we could enable that
codegen by default.
Quuxplusone commented 9 years ago
There's no way we can do what the customer wants in all cases, so give them
more control:
http://reviews.llvm.org/D8982
http://reviews.llvm.org/D8989
Quuxplusone commented 9 years ago
After:
http://reviews.llvm.org/rL240310

I think we finally have the behavior that was requested in comment 23 (for x86
only though): we generate square root and reciprocal estimate instructions with
-ffast-math except for scalar reciprocal estimates. This should match gcc's
defaults.

The interface is identical to gcc:
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/i386-and-x86-64-Options.html#index-mrecip_003dopt-1627

except for one addition: you can control the number of estimate refinement
steps using something like this:
$ clang -O2 -ffast-math -mrecip=sqrtf:2

The '2' means use two refinement steps when generating a reciprocal square root
estimate for a single-precision (float).

The only thing missing (I hope): it's completely undocumented!

I assume something should be added here:
http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation

and we should have a line of helptext for the clang option, so you at least
know the option exists with "clang -help".
Quuxplusone commented 9 years ago
(In reply to comment #37)
> This should match gcc's defaults.

That was wrong: bug 24063. It's more complicated than I thought.
Quuxplusone commented 8 years ago

Attached sqrt_accuracy_tester.c (3588 bytes, text/x-csrc): sqrt_accuracy_tester.c