Complex division is not optimised with -ffast-math

Quuxplusone commented 7 years ago


Bugzilla Link	PR31872
Status	NEW
Importance	P normal
Reported by	drraph@gmail.com
Reported on	2017-02-05 15:52:51 -0800
Last modified on	2019-10-10 10:05:17 -0700
Version	trunk
Hardware	PC Linux
CC	arphaman@gmail.com, compnerd@compnerd.org, david.l.kreitzer@intel.com, ditaliano@apple.com, drraph@gmail.com, hfinkel@anl.gov, joker.eph@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, michael.v.zolotukhin@gmail.com, mkuper@google.com, spatel+llvm@rotateright.com, t.p.northover@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Consider:

#include <complex.h>
complex float f(complex float x, complex float y) {
  return x/y;
}

clang trunk with -O3 -march=core-avx2 but with or without -ffast-math gives:

f:                                      # @f
        vmovaps xmm2, xmm1
        vmovshdup       xmm1, xmm0      # xmm1 = xmm0[1,1,3,3]
        vmovshdup       xmm3, xmm2      # xmm3 = xmm2[1,1,3,3]
        jmp     __divsc3                # TAILCALL

However both gcc and ICC attempt to optimise this code when -ffast-math (or
equivalent) is enabled.

ICC appears to give the fastest code which is:

f:
        vcvtps2pd xmm2, xmm1                                    #3.12
        vcvtps2pd xmm4, xmm0                                    #3.12
        vmulpd    xmm8, xmm2, xmm2                              #3.12
        vunpckhpd xmm3, xmm2, xmm2                              #3.12
        vmulpd    xmm6, xmm3, xmm4                              #3.12
        vmovddup  xmm7, xmm2                                    #3.12
        vshufpd   xmm5, xmm4, xmm4, 1                           #3.12
        vshufpd   xmm9, xmm8, xmm8, 1                           #3.12
        vfmaddsub213pd xmm7, xmm5, xmm6                         #3.12
        vaddpd    xmm11, xmm8, xmm9                             #3.12
        vshufpd   xmm10, xmm7, xmm7, 1                          #3.12
        vdivpd    xmm12, xmm10, xmm11                           #3.12
        vcvtpd2ps xmm0, xmm12                                   #3.12
        ret

Quuxplusone commented 7 years ago

As a note, when Alex worked on the flang project, he had to experiment with
different division algorithms to find one that was sufficiently numerically
stable (so that the BLAS regressions tests would pass, as I recall). This may
or may not be relevant to fast-math complication, but just in case, see:
  https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp

Quuxplusone commented 7 years ago

It looks like ICC doesn't use the Smith's version, it looks more like the naive division, i.e. "(a+ib) / (c+id) = ((ac+bd)/(cc+dd)) + i((bc-ad)/(cc+dd))". But they promote the floats to doubles, so I guess that makes the precision of the naive algorithm better.

What does ICC do for complex double with -ffast-math?

Quuxplusone commented 7 years ago

ICC appears to have a bug/feature when you change the type to complex double
where you have to set -fp-model fast=2 to get anything sensible (with fast=1
you get x87 code!). In the fast=2 case you get:

f:
        vunpcklpd xmm4, xmm2, xmm3                              #2.54
        vunpcklpd xmm6, xmm0, xmm1                              #2.54
        vunpckhpd xmm5, xmm4, xmm4                              #3.12
        vmulpd    xmm10, xmm4, xmm4                             #3.12
        vmulpd    xmm8, xmm5, xmm6                              #3.12
        vmovddup  xmm9, xmm4                                    #3.12
        vshufpd   xmm7, xmm6, xmm6, 1                           #3.12
        vshufpd   xmm11, xmm10, xmm10, 1                        #3.12
        vfmaddsub213pd xmm9, xmm7, xmm8                         #3.12
        vaddpd    xmm13, xmm10, xmm11                           #3.12
        vshufpd   xmm12, xmm9, xmm9, 1                          #3.12
        vdivpd    xmm0, xmm12, xmm13                            #3.12
        vunpckhpd xmm1, xmm0, xmm0                              #3.12
        ret                                                     #3.12

gcc -O3 -ffast-math -march=core-avx2 for the complex double code on the other
hand gives:

f:
        vmulsd  xmm4, xmm1, xmm3
        vmovapd xmm6, xmm0
        vmulsd  xmm5, xmm3, xmm3
        vmulsd  xmm6, xmm6, xmm3
        vfmadd231sd     xmm4, xmm0, xmm2
        vfmadd231sd     xmm5, xmm2, xmm2
        vfmsub132sd     xmm1, xmm6, xmm2
        vdivsd  xmm0, xmm4, xmm5
        vdivsd  xmm1, xmm1, xmm5
        ret

This may be better code, I am not expert enough to tell.

Quuxplusone commented 7 years ago

I think it could be a feature in ICC. With '-ffast-math' ICC promotes complex floats to doubles, do it's reasonable to assume that it would promote complex doubles to the 80 bit long doubles. That means it can't use SSE/AVX, and it has to emit the x87 FPU code.

What does ICC do for complex float and '-fp-model fast=2'? I suspect it doesn't promote them to doubles.

Looking at comparison between ICC and GCC it's interesting how ICC leverage the *pd instructions to reduce the number of arithmetic instructions in the code. I'm not sure if it's better than GCC's version though in terms of performance. It definitely looks worse from the code-size perspective.

Quuxplusone commented 7 years ago

I looked at ICC and this code as requested:

#include <complex.h>
complex float f(complex float x, complex float y) {
  return x/y;
}

Using '-fp-model fast=2  -march=core-avx2 -O3' you get

f:
        vmovlhps  xmm2, xmm1, xmm1                              #3.12
        vmulps    xmm8, xmm2, xmm2                              #3.12
        vshufps   xmm9, xmm8, xmm8, 177                         #3.12
        vmovlhps  xmm4, xmm0, xmm0                              #3.12
        vaddps    xmm10, xmm8, xmm9                             #3.12
        vrcpps    xmm11, xmm10                                  #3.12
        vmovshdup xmm3, xmm2                                    #3.12
        vaddps    xmm12, xmm11, xmm11                           #3.12
        vmulps    xmm6, xmm4, xmm3                              #3.12
        vmulps    xmm14, xmm11, xmm10                           #3.12
        vmovsldup xmm7, xmm2                                    #3.12
        vshufps   xmm5, xmm4, xmm4, 177                         #3.12
        vfmaddsub213ps xmm7, xmm5, xmm6                         #3.12
        vfnmadd213ps xmm14, xmm11, xmm12                        #3.12
        vshufps   xmm13, xmm7, xmm7, 177                        #3.12
        vmulps    xmm0, xmm13, xmm14                            #3.12
        ret

This is slightly longer code than you get for fast=1 performing 4
multiplications and one reciprocal.

It might be worth mentioning that
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-
architectures-optimization-manual.pdf discusses vectorized complex arithmetic,
for example in 6.6.1.1 which covers multiplication and division using SSE3.  I
don't know if it's helpful here.

Quuxplusone commented 7 years ago

For completeness:

#include <complex.h>
complex long double f(complex long double x, complex long double y) {
  return x/y;
}

In clang with -march=core-avx2 -Ofast  you get

f:                                      # @f
        sub     rsp, 72
        fld     xword ptr [rsp + 80]
        fld     xword ptr [rsp + 96]
        fld     xword ptr [rsp + 112]
        fld     xword ptr [rsp + 128]
        fstp    xword ptr [rsp + 48]
        fstp    xword ptr [rsp + 32]
        fstp    xword ptr [rsp + 16]
        fstp    xword ptr [rsp]
        call    __divxc3
        add     rsp, 72
        ret

In gcc and ICC the __divxc3 code is optimised (very differently) but still
using x87 instructions.   The gcc code is much shorter (and I suspect better)
than the ICC code and is:

f:
        fld     TBYTE PTR [rsp+8]
        fld     TBYTE PTR [rsp+24]
        fld     TBYTE PTR [rsp+40]
        fld     TBYTE PTR [rsp+56]
        fld     st(1)
        fmul    st, st(2)
        fld     st(1)
        fmul    st, st(2)
        faddp   st(1), st
        fld     st(4)
        fmul    st, st(3)
        fld     st(4)
        fmul    st, st(3)
        faddp   st(1), st
        fdiv    st, st(1)
        fxch    st(4)
        fmulp   st(3), st
        fxch    st(4)
        fmulp   st(1), st
        fsubp   st(1), st
        fdivrp  st(2), st
        ret

Quuxplusone commented 7 years ago

Thanks for collecting all of this information! It looks like I was right about ICC not promoting complex float with '-fp-model fast=2'.

I wonder if clang/llvm should follow ICC and try to promote the floating point types with '-ffast-math' or should it just use the original type like GCC seems to do even though the numerical stability might be effected.

Quuxplusone / LLVMBugzillaTest

Complex division is not optimised with -ffast-math #30844