Open Quuxplusone opened 10 years ago
Test program output on Sandy Bridge:
Total tests = 2113929216
Inexact results with one N-R step = 657670214
Results off by > 1 ULP with one N-R step = 10207460
Inexact results with two N-R steps = 353480402
Results off by > 1 ULP with two N-R steps = 0
Test program output on Jaguar:
Total tests = 2113929216
Inexact results with one N-R step = 646439942
Results off by > 1 ULP with one N-R step = 5167130
Inexact results with two N-R steps = 374437446
Results off by > 1 ULP with two N-R steps = 0
On second thought...I wrote a similar test loop for the rsqrt estimate, and we
don't get within 1 ULP in many cases, but that estimate is apparently generally
accepted.
This leads to the more general question: how accurate does fast-math need to be
relative to IEEE math?
The real reason that gcc differs on rsqrt and recip optimizations has to do
with passing a SPEC test rather than accuracy:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1
(In reply to comment #3)
> On second thought...I wrote a similar test loop for the rsqrt estimate, and
> we don't get within 1 ULP in many cases, but that estimate is apparently
> generally accepted.
For "fast math" we can probably do with 1-2 ulps. This, of course, is also
really application dependent.
>
> This leads to the more general question: how accurate does fast-math need to
> be relative to IEEE math?
>
> The real reason that gcc differs on rsqrt and recip optimizations has to do
> with passing a SPEC test rather than accuracy:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1
We should likely make this configurable (at least via a command-line option).
How much does the performance change if you use two?
Attached recip_accuracy_tester.c
(3362 bytes, text/plain): recip_accuracy_tester.c
Attached rsqrt_accuracy_tester.c
(3255 bytes, text/plain): rsqrt_accuracy_tester.c
Before getting to the perf question, I want to make sure we understand the
accuracy angle.
Here are the results from the attached test loops for recip and rsqrt running
on Sandy Bridge:
Reciprocal estimate:
Total tests = 2113929216
Inexact results with one N-R step = 657670214
Estimate missed by 1 ULP: 647462754
Estimate missed by 2 ULP: 10186690
Estimate missed by 3 ULP: 20770
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 353480402
Estimate missed by 1 ULP: 353480402
Estimate missed by 2 ULP: 0
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
Rsqrt estimate:
Total tests = 2130706432
Inexact results with one N-R step = 886838505
Estimate missed by 1 ULP: 830844699
Estimate missed by 2 ULP: 53541827
Estimate missed by 3 ULP: 2399100
Estimate missed by 4 ULP: 52879
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 758708881
Estimate missed by 1 ULP: 742129840
Estimate missed by 2 ULP: 16578375
Estimate missed by 3 ULP: 666
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
And the same programs running on Jaguar:
Reciprocal estimate:
Total tests = 2113929216
Inexact results with one N-R step = 646439942
Estimate missed by 1 ULP: 641272812
Estimate missed by 2 ULP: 5161649
Estimate missed by 3 ULP: 5481
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 374437446
Estimate missed by 1 ULP: 374437446
Estimate missed by 2 ULP: 0
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
Rsqrt estimate:
Total tests = 2130706432
Inexact results with one N-R step = 793771474
Estimate missed by 1 ULP: 773089565
Estimate missed by 2 ULP: 20629077
Estimate missed by 3 ULP: 52832
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 723046956
Estimate missed by 1 ULP: 714982537
Estimate missed by 2 ULP: 8064404
Estimate missed by 3 ULP: 15
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
So Jaguar's estimate function looks better, particularly for rsqrt. It doesn't miss either estimate by more than 3 ULPs using one refinement step. Sandy Bridge only gets to within 4 ULPs on rsqrt for some cases.
I think the good news is that at least we're within 4 ULPs in all cases.
If we judge the recip estimate purely on this accuracy metric, the single refinement result for it is better than what we (and other compilers) already accept as good enough for rsqrt (where we use just one refinement step).
There are at least two additional variables that we may want to consider:
1. The implementation of the refinement algorithm matters. For example, we can
save an fmul on the reciprocal N-R step by coding the N-R equation as:
est1 = est0 * (2.0f - est0 * f);
instead of:
est1 = est0 + est0 * (1.0f - est0 * f);
But this leads to a less accurate final answer according to:
http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods
And testing on Jaguar appears to confirm that:
Total tests = 2113929216
Inexact results with one N-R step = 833587524
Estimate missed by 1 ULP: 819988344
Estimate missed by 2 ULP: 13579776
Estimate missed by 3 ULP: 19404
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 617003604
Estimate missed by 1 ULP: 616978404
Estimate missed by 2 ULP: 25200
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
The total misses are much higher, but we actually have slightly fewer 3 ULP
misses with the single refinement step. Probably not worth saving a flop for
that?
2. Everything should be retested on a machine (like Haswell) that has FMA. The
added precision of FMA should mean better results, and it's possible that the
initial estimate has improved too (certainly hope it hasn't gotten worse).
(In reply to comment #10)
> The total misses are much higher, but we actually have slightly fewer 3 ULP
> misses with the single refinement step. Probably not worth saving a flop for
> that?
Disregard - too many data points! I was looking at the Sandy Bridge number
(20770) rather than the Jaguar number (5481). Using the longer equation appears
to produce better results in all head-to-head categories for Jaguar.
Now back to the perf question. Let me focus just on Jaguar for the moment
because we'll want to enable this per-CPU for x86 as we did with rsqrt and
having a feature flag for this codegen provides an easy opt-out or -in for
users.
The divss instruction (IEEE-compliant) takes 14 cycles (not pipelined of
course).
The rcpss instruction alone takes 2 cycles.
The rcpss inst + one NR step using mul/sub/mul/add takes 12 cycles.
The rcpss inst + two NR steps using the same mul/sub/mul/add sequence takes 22
cycles.
The breakdown of the refinement cost is: 2 cycles for mul, 3 cycles for add/sub
(how this chip can multiply faster than add is a different story :) ). For the
4 instruction N-R implementation, all ops are pipelined but dependent, so we
just add up the individual cycle counts.
So we have a clear win for using a single N-R step with -ffast-math: it's
faster in raw latency and 14x faster for throughput because all instructions
are pipelined. Using 2 N-R steps means we have to weigh the latency vs.
throughput scenarios.
The final question - at least in my mind - is: what is the output around 1.0f?
gcc cited that as part of the reason for not enabling recip with -ffast-math,
but that doesn't match the internal discussion or the actual behavior; they
*do* generate rcpps (vector) with -ffast-math, so not generating rcpss (scalar)
for accuracy reasons does not follow. The 2 instructions should have identical
output on all HW.
input 0x3f7fffff: 9.999999e-01
ieee = 1.000000e+00 (0x3f800001)
est_nr1 = 1.000000e+00 (0x3f800000)
est_nr2 = 1.000000e+00 (0x3f800000)
input 0x3f800000: 1.000000e+00
ieee = 1.000000e+00 (0x3f800000)
est_nr1 = 9.999999e-01 (0x3f7fffff)
est_nr2 = 1.000000e+00 (0x3f800000)
input 0x3f800001: 1.000000e+00
ieee = 9.999999e-01 (0x3f7ffffe)
est_nr1 = 9.999998e-01 (0x3f7ffffd)
est_nr2 = 9.999999e-01 (0x3f7ffffe)
Using one N-R step, we're off by 1 ULP in all 3 cases. The second N-R step
corrects the second and third cases to match the IEEE result.
This is using the 4 op N-R sequence on Jaguar.
So I agree with Hal's statement in comment 4 that we should offer the number of
N-R refinement steps as a parameter to the user. They can weigh the perf vs.
accuracy trade-off per application. I would default to one N-R step.
Patch proposal for review:
http://reviews.llvm.org/D6175
Allow rcp* codegen for btver2 checked in with:
http://reviews.llvm.org/rL221706
I still need to expose the number of refinement steps as a command-line option.
We may also want to offer multiple N-R implementations as an option (as we did
for rsqrt).
For the record, I ran the attached reciprocal test program on Haswell (AVX2),
and it looks like Intel did not alter the estimate function vs. Sandy Bridge.
These are identical results to SB:
$ ./recip_avx2_no_fma
Total tests = 2113929216
Inexact results with one N-R step = 657670214
Estimate missed by 1 ULP: 647462754
Estimate missed by 2 ULP: 10186690
Estimate missed by 3 ULP: 20770
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 353480402
Estimate missed by 1 ULP: 353480402
Estimate missed by 2 ULP: 0
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
But when we use FMA for the N-R refinement (the 4-step implementation just
becomes 2 FMAs), we get much more accurate results:
$ ./recip_avx2_fma
Total tests = 2113929216
Inexact results with one N-R step = 331401168
Estimate missed by 1 ULP: 330208704
Estimate missed by 2 ULP: 1192464
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with one N-R step = 0
Inexact results with two N-R steps = 252
Estimate missed by 1 ULP: 252
Estimate missed by 2 ULP: 0
Estimate missed by 3 ULP: 0
Estimate missed by 4 ULP: 0
Estimate missed by >= 5 ULP with two N-R steps = 0
With FMA3 or FMA4 enabled are the FMA instructions getting used in the fast-math NR iteration automatically or is it something that has to be explicitly set?
(In reply to comment #17)
> With FMA3 or FMA4 enabled are the FMA instructions getting used in the
> fast-math NR iteration automatically or is it something that has to be
> explicitly set?
FMA should be automatically generated when -ffast-math is used.
For the test program, I made sure I wasn't replacing the actual div
instructions or using any other FP shortcuts by compiling with:
$ clang recip_accuracy_tester.c -O2 -march=core-avx2 -ffp-contract=fast
Refinement accuracy made configurable via cl::opt:
http://llvm.org/viewvc/llvm-project?view=revision&revision=221819
With that, I think we can resolve this bug as fixed.
The customer can control generation of rcpps/rcpss with "-use-recip-est" and
control the speed vs. accuracy trade-off with "-x86-recip-refinement-steps".
Is it worth adding a test that checks that FMA instructions are being used on FMA3/FMA4 targets?
I have no context for any of the floating point precision stuff, but is a backend flag really the right way to expose this? I thought we were about to launch a witch hunt to remove all these flags that create hidden global state.
Are you sure you don't want an instruction, module, or function flag?
Hi Reid,
I wasn't aware that we were trying to get rid of these, so I was using the existing cl::opt code that I've seen in the vectorizers and backend as a model.
Certainly, if there's a better way, I'll change this. Is there an existing param that you recommend using as a template for this sort of code?
I'd like to reopen this issue to ask why FeatureUseSqrtEst and
FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable benefits
on many if not all of the other x86 microarchitectures. Is there a reason it
cannot be enabled more broadly?
I'm not sure what GCC's criteria is for enabling it, but I've seen reciprocal
square root estimate enabled on every x86 -march= I know of whenever -ffast-
math is specified and SSE is available. For example, even as far back as -
march=pentium3 rsqrtss is used:
float rsqrtf(float f)
{
return 1.0f / sqrtf(f);
}
$ gcc -m32 -O3 -ffast-math -mfpmath=sse -march=pentium3 -S -o - rsqrt.c
[...]
rsqrtf:
subl $4, %esp
rsqrtss 8(%esp), %xmm1
movss 8(%esp), %xmm0
mulss %xmm1, %xmm0
mulss %xmm1, %xmm0
mulss .LC1, %xmm1
addss .LC0, %xmm0
mulss %xmm1, %xmm0
movss %xmm0, (%esp)
flds (%esp)
popl %eax
ret
.LC0:
.long 3225419776
.LC1:
.long 3204448256
GCC does not, however, emit reciprocal estimates, even with -march=haswell.
It's possible that GCC does not implement any selection of RCPSS:
float recipf(float f)
{
return 1.0f / f;
}
$ gcc -O3 -ffast-math -mfpmath=sse -march=haswell -S -o - recip.c
[...]
vmovss .LC0(%rip), %xmm1
vdivss %xmm0, %xmm1, %xmm0
ret
.LC0:
.long 1065353216
At the very least I'd like to see reciprocal square root estimates added by
default on x86 for -ffast-math. How can we get such a change implemented? Is it
a matter of building confidence in the safety and benefit of such a change?
(In reply to comment #23)
> I'd like to reopen this issue to ask why FeatureUseSqrtEst and
> FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable
> benefits on many if not all of the other x86 microarchitectures. Is there a
> reason it cannot be enabled more broadly?
>
...
>
> At the very least I'd like to see reciprocal square root estimates added by
> default on x86 for -ffast-math. How can we get such a change implemented? Is
> it a matter of building confidence in the safety and benefit of such a
> change?
I don't think that the accuracy concerns are particularly significant (even if
the estimates are slightly less accurate than the AMD implementations). In
reality, I think that someone just needs to do the benchmarking on these
systems to show that it is worthwhile.
I think this can all be done with command-line args at this point with the in-
tree code.
(In reply to comment #24)
> (In reply to comment #23)
> > I'd like to reopen this issue to ask why FeatureUseSqrtEst and
> > FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable
> > benefits on many if not all of the other x86 microarchitectures. Is there a
> > reason it cannot be enabled more broadly?
...
> I don't think that the accuracy concerns are particularly significant (even
> if the estimates are slightly less accurate than the AMD implementations).
> In reality, I think that someone just needs to do the benchmarking on these
> systems to show that it is worthwhile.
>
> I think this can all be done with command-line args at this point with the
> in-tree code.
That's correct. And I agree that both rcp and rsqrt can provide significant
throughput perf benefits on most (all?) recent x86 chips.
But I didn't want to overstep with the initial patch, so I proposed and checked
in the changes to enable these by default on Jaguar only because they clearly
provide a throughput *and* latency win there. The only way that I can think of
where using the estimates could cause a loss on Jaguar is via increased
register pressure/usage.
Each of rsqrt and rcp codegen can be enabled via a subtarget feature flag, so
someone just needs to run benchmarks that they care about on chips that they
care about to change the default settings.
Between the number of N-R iterations, N-R implementation, FMA, and HW accuracy,
there are many permutations of the codegen that should still be tested. Some of
that experimenting will require changing code in LLVM, but I think all of the
info and tools needed for that are in this bug report. I can provide
assistance, but that work is out of my current scope.
I'm not sure I'm the best person to come up with a set of tests for this issue, but if someone can design some tests I have a bunch of Intel hardware around: NHM, WSM, SBR, IVB, and HSW. Not sure I have anything older than Nehalem, but I'm not sure anyone cares about code generation on ancient hardware.
From my testing using my n-body test, I've found the performance benefits are remarkable, and the accuracy when comparing iterations between '-O3 -ffast-math' and '-O3 -mfpmath=387' showed that the margin of error was incredibly small. I'll try to characterize these two statements with Hard Facts(tm) once I re-run and record some data. But this test alone is probably not enough to satisfy the questions regarding codegen diversity.
(In reply to comment #23)
> GCC does not, however, emit reciprocal estimates, even with -march=haswell.
> It's possible that GCC does not implement any selection of RCPSS:
This is explained by:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1
Using rcpss broke SPEC2006, so they decided to hide that codegen behind another
flag. I don't think we want to play that kind of benchmarking game with LLVM.
Ugh. Is there a reason we didn't see precision issues with rcpss in your testing? Microarchitecture differences? Missed some edge cases?
If rcpss breaks real-world code then that's one thing. We can't test every possible code sequence under the sun or we'd never get anything done, but we can run it through smoke testing. And we can always revert a change if it breaks something.
I'd like to figure out a sane criteria for enabling these optimizations. How can we proceed? (I'm hoping for specific steps here because I don't want to see this fall by the wayside.)
(In reply to comment #28)
> Ugh. Is there a reason we didn't see precision issues with rcpss in your
> testing? Microarchitecture differences? Missed some edge cases?
>
> If rcpss breaks real-world code then that's one thing.
I think you need to be careful about setting criterion based on "break[ing]
real-world code" when talking about precision-reducing optimizations like this.
No matter what accuracy threshold you set, you'll always find something you
break.
> We can't test every
> possible code sequence under the sun or we'd never get anything done, but we
> can run it through smoke testing. And we can always revert a change if it
> breaks something.
>
> I'd like to figure out a sane criteria for enabling these optimizations. How
> can we proceed? (I'm hoping for specific steps here because I don't want to
> see this fall by the wayside.)
Sanjay has posted his accuracy testers, and running those would be good. I
recommend testing speed with an N-body code, or similar kernel. That should be
sufficient.
(In reply to comment #29)
> I think you need to be careful about setting criterion based on "break[ing]
> real-world code" when talking about precision-reducing optimizations like
> this. No matter what accuracy threshold you set, you'll always find
> something you break.
Exactly - if you review the data in comment 7 and comment 8, you'll see that
rsqrt estimates actually have worse accuracy using a single N-R step than rcp
estimates (relative to the IEEE answer). But GCC lets that go because it
doesn't affect SPEC.
Every app will have a different tolerance for FP accuracy, so we're allowing
the user to turn that codegen on/off, and in the case of rcp, we even have a
knob (although poorly implemented with a cl::opt) to change the number of N-R
steps. I'd like to add a similar N-R step param for rsqrt, but I was holding
off pending some kind of resolution for how to do that better (see: bug 21471
and http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078785.html).
Isn't hiding this behind -ffast-math already a sufficient "use at your own risk" caveat?
(In reply to comment #31)
> Isn't hiding this behind -ffast-math already a sufficient "use at your own
> risk" caveat?
Sure, but -ffast-math is a big hammer. We don't want users to have to turn that
off entirely just because the estimate codegen wasn't accurate or fast enough.
We're providing fine grain tuning with the other flags (see also bug 20912).
There are 2 issues though, so let's try to separate this:
(1) Is the accuracy good enough to enable this by default?
(2) Is the performance better enough to enable this by default?
I think I've already answered (1) with a "yes" when using one N-R step. The
data for Jaguar, Sandy Bridge, and Haswell is in this bug report. SPEC be
damned. This is just restating what Hal said in comment 24.
If one N-R step isn't good enough (or the app can tolerate even less accurate
results), the user can change that setting.
For (2), we already know that latency for an estimate sequence won't be as good
on the Intel chips but throughput will improve.
We should determine exactly how many parallel calcs are needed to make the
estimate code perform better. I think that answer will be just 2 or 3...that
seems like a low enough cross-over that we would then feel justified in
enabling the codegen for the chip in question. Ie, we wouldn't expect most real
world code to be perf-limited by the latency of 1 or 2 div/sqrt ops.
Some naive questions:
Aren't the instruction scheduling definitions in lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td sufficient for LLVM to make smart decisions about whether or not the latency cost of the reciprocal square root estimate is worth it?
It sounds like people would be satisfied with only using the reciprocal square root estimate by default in vector operations. That is, there's not sufficient justification to enable it for scalar operations because of the latency impact. Is that correct? If so, what would need to change to make it only enable the rsqrt estimate for vector operations?
(In reply to comment #33)
>
> 1. Aren't the instruction scheduling definitions in
> lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td
> sufficient for LLVM to make smart decisions about whether or not the latency
> cost of the reciprocal square root estimate is worth it?
I agree that the sched defs should provide more info to make a decision like
this, and I left FIXME comments in X86TargetLowering::getRsqrtEstimate() and
X86TargetLowering::getRecipEstimate() about that.
But when I looked into it, I didn't see a way to get that info into
X86TargetLowering. It's entirely possible that I just overlooked some API to
make it happen.
Note that the vectorizers don't use the sched timings either. They use an
alternate cost model (bug 21356, comment 5)...I don't know why we have this
other cost model.
The harder problem, as I wrote in the comments, is that Intel chips (and
AMD's?) have logic in the FPU HW that makes div/sqrt timing variable based on
the values of the operands. I don't see any way to account for that statically.
We might be able to detect some simple cases with some kind of FP value
tracking (see bug 17713), but there's no way to solve that in general.
(In reply to comment #33)
> 2. It sounds like people would be satisfied with only using the reciprocal
> square root estimate by default in vector operations. That is, there's not
> sufficient justification to enable it for scalar operations because of the
> latency impact. Is that correct? If so, what would need to change to make it
> only enable the rsqrt estimate for vector operations?
No - the vector versions of the real instructions and the estimate instructions
tend to have the same latency/throughput as the scalar versions. (There are
special cases like 256-bit vectors on Jaguar, where everything has to be double-
pumped for 256-bit vectors.)
So in general, we have the same issue with vectors as scalars. As long as you
can show that N pipelined executions of the vector estimate sequences are
faster than their full-precision counterparts, then we could enable that
codegen by default.
There's no way we can do what the customer wants in all cases, so give them
more control:
http://reviews.llvm.org/D8982
http://reviews.llvm.org/D8989
After:
http://reviews.llvm.org/rL240310
I think we finally have the behavior that was requested in comment 23 (for x86
only though): we generate square root and reciprocal estimate instructions with
-ffast-math except for scalar reciprocal estimates. This should match gcc's
defaults.
The interface is identical to gcc:
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/i386-and-x86-64-Options.html#index-mrecip_003dopt-1627
except for one addition: you can control the number of estimate refinement
steps using something like this:
$ clang -O2 -ffast-math -mrecip=sqrtf:2
The '2' means use two refinement steps when generating a reciprocal square root
estimate for a single-precision (float).
The only thing missing (I hope): it's completely undocumented!
I assume something should be added here:
http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation
and we should have a line of helptext for the clang option, so you at least
know the option exists with "clang -help".
(In reply to comment #37)
> This should match gcc's defaults.
That was wrong: bug 24063. It's more complicated than I thought.
Attached sqrt_accuracy_tester.c
(3588 bytes, text/x-csrc): sqrt_accuracy_tester.c
recip_accuracy_tester.c
(3362 bytes, text/plain)rsqrt_accuracy_tester.c
(3255 bytes, text/plain)sqrt_accuracy_tester.c
(3588 bytes, text/x-csrc)