Open Quuxplusone opened 7 years ago
Bugzilla Link | PR31872 |
Status | NEW |
Importance | P normal |
Reported by | drraph@gmail.com |
Reported on | 2017-02-05 15:52:51 -0800 |
Last modified on | 2019-10-10 10:05:17 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | arphaman@gmail.com, compnerd@compnerd.org, david.l.kreitzer@intel.com, ditaliano@apple.com, drraph@gmail.com, hfinkel@anl.gov, joker.eph@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, michael.v.zolotukhin@gmail.com, mkuper@google.com, spatel+llvm@rotateright.com, t.p.northover@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
As a note, when Alex worked on the flang project, he had to experiment with
different division algorithms to find one that was sufficiently numerically
stable (so that the BLAS regressions tests would pass, as I recall). This may
or may not be relevant to fast-math complication, but just in case, see:
https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp
It looks like ICC doesn't use the Smith's version, it looks more like the naive division, i.e. "(a+ib) / (c+id) = ((ac+bd)/(cc+dd)) + i((bc-ad)/(cc+dd))". But they promote the floats to doubles, so I guess that makes the precision of the naive algorithm better.
What does ICC do for complex double
with -ffast-math?
ICC appears to have a bug/feature when you change the type to complex double
where you have to set -fp-model fast=2 to get anything sensible (with fast=1
you get x87 code!). In the fast=2 case you get:
f:
vunpcklpd xmm4, xmm2, xmm3 #2.54
vunpcklpd xmm6, xmm0, xmm1 #2.54
vunpckhpd xmm5, xmm4, xmm4 #3.12
vmulpd xmm10, xmm4, xmm4 #3.12
vmulpd xmm8, xmm5, xmm6 #3.12
vmovddup xmm9, xmm4 #3.12
vshufpd xmm7, xmm6, xmm6, 1 #3.12
vshufpd xmm11, xmm10, xmm10, 1 #3.12
vfmaddsub213pd xmm9, xmm7, xmm8 #3.12
vaddpd xmm13, xmm10, xmm11 #3.12
vshufpd xmm12, xmm9, xmm9, 1 #3.12
vdivpd xmm0, xmm12, xmm13 #3.12
vunpckhpd xmm1, xmm0, xmm0 #3.12
ret #3.12
gcc -O3 -ffast-math -march=core-avx2 for the complex double code on the other
hand gives:
f:
vmulsd xmm4, xmm1, xmm3
vmovapd xmm6, xmm0
vmulsd xmm5, xmm3, xmm3
vmulsd xmm6, xmm6, xmm3
vfmadd231sd xmm4, xmm0, xmm2
vfmadd231sd xmm5, xmm2, xmm2
vfmsub132sd xmm1, xmm6, xmm2
vdivsd xmm0, xmm4, xmm5
vdivsd xmm1, xmm1, xmm5
ret
This may be better code, I am not expert enough to tell.
I think it could be a feature in ICC. With '-ffast-math' ICC promotes complex floats to doubles, do it's reasonable to assume that it would promote complex doubles to the 80 bit long doubles. That means it can't use SSE/AVX, and it has to emit the x87 FPU code.
What does ICC do for complex float
and '-fp-model fast=2'? I suspect it doesn't promote them to doubles.
Looking at comparison between ICC and GCC it's interesting how ICC leverage the *pd instructions to reduce the number of arithmetic instructions in the code. I'm not sure if it's better than GCC's version though in terms of performance. It definitely looks worse from the code-size perspective.
I looked at ICC and this code as requested:
#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}
Using '-fp-model fast=2 -march=core-avx2 -O3' you get
f:
vmovlhps xmm2, xmm1, xmm1 #3.12
vmulps xmm8, xmm2, xmm2 #3.12
vshufps xmm9, xmm8, xmm8, 177 #3.12
vmovlhps xmm4, xmm0, xmm0 #3.12
vaddps xmm10, xmm8, xmm9 #3.12
vrcpps xmm11, xmm10 #3.12
vmovshdup xmm3, xmm2 #3.12
vaddps xmm12, xmm11, xmm11 #3.12
vmulps xmm6, xmm4, xmm3 #3.12
vmulps xmm14, xmm11, xmm10 #3.12
vmovsldup xmm7, xmm2 #3.12
vshufps xmm5, xmm4, xmm4, 177 #3.12
vfmaddsub213ps xmm7, xmm5, xmm6 #3.12
vfnmadd213ps xmm14, xmm11, xmm12 #3.12
vshufps xmm13, xmm7, xmm7, 177 #3.12
vmulps xmm0, xmm13, xmm14 #3.12
ret
This is slightly longer code than you get for fast=1 performing 4
multiplications and one reciprocal.
It might be worth mentioning that
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-
architectures-optimization-manual.pdf discusses vectorized complex arithmetic,
for example in 6.6.1.1 which covers multiplication and division using SSE3. I
don't know if it's helpful here.
For completeness:
#include <complex.h>
complex long double f(complex long double x, complex long double y) {
return x/y;
}
In clang with -march=core-avx2 -Ofast you get
f: # @f
sub rsp, 72
fld xword ptr [rsp + 80]
fld xword ptr [rsp + 96]
fld xword ptr [rsp + 112]
fld xword ptr [rsp + 128]
fstp xword ptr [rsp + 48]
fstp xword ptr [rsp + 32]
fstp xword ptr [rsp + 16]
fstp xword ptr [rsp]
call __divxc3
add rsp, 72
ret
In gcc and ICC the __divxc3 code is optimised (very differently) but still
using x87 instructions. The gcc code is much shorter (and I suspect better)
than the ICC code and is:
f:
fld TBYTE PTR [rsp+8]
fld TBYTE PTR [rsp+24]
fld TBYTE PTR [rsp+40]
fld TBYTE PTR [rsp+56]
fld st(1)
fmul st, st(2)
fld st(1)
fmul st, st(2)
faddp st(1), st
fld st(4)
fmul st, st(3)
fld st(4)
fmul st, st(3)
faddp st(1), st
fdiv st, st(1)
fxch st(4)
fmulp st(3), st
fxch st(4)
fmulp st(1), st
fsubp st(1), st
fdivrp st(2), st
ret
Thanks for collecting all of this information! It looks like I was right about ICC not promoting complex float
with '-fp-model fast=2'.
I wonder if clang/llvm should follow ICC and try to promote the floating point types with '-ffast-math' or should it just use the original type like GCC seems to do even though the numerical stability might be effected.