Optimization pass to combine sin, cos library calls to sincos

Quuxplusone commented 12 years ago


Bugzilla Link	PR13204
Status	NEW
Importance	P enhancement
Reported by	Andrew Wagner (drewm1980@gmail.com)
Reported on	2012-06-25 10:06:01 -0700
Last modified on	2018-04-26 06:58:51 -0700
Version	trunk
Hardware	PC Linux
CC	andrea.dibiagio@gmail.com, baldrick@free.fr, evan.cheng@apple.com, geek4civic@gmail.com, greg.bedwell@sony.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, paul.redmond@intel.com, rob.lougher@gmail.com, tim.besard@elis.ugent.be, trass3r@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

On gcc, when calls to both sin() and cos() math library calls are present, the
get optimized to a single call to sincos().

Here is a newsgroup thread with a small benchmark:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-September/043234.html

I checked a quick code snippet:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char **argv) {
float x = *(float*) argv[0];
return (int) sinf(x) + cosf(x) + sinf(x) + cosf(x);
}

in the online demo, and this still seems to be the case in clang 3:

.Ltmp2:
    .cfi_def_cfa_offset 16
    movq    (%rsi), %rax
    movss   (%rax), %xmm0
    movss   %xmm0, (%rsp)           # 4-byte Spill
    callq   cosf
    movss   %xmm0, 4(%rsp)          # 4-byte Spill
    movss   (%rsp), %xmm0           # 4-byte Reload
    callq   sinf
    cvttss2si   %xmm0, %eax
    cvtsi2ss    %eax, %xmm1
    movss   4(%rsp), %xmm2          # 4-byte Reload
    addss   %xmm2, %xmm1
    addss   %xmm0, %xmm1
    addss   %xmm2, %xmm1
    cvttss2si   %xmm1, %eax
    popq    %rdx
    ret

Adding support for sincos, sincosf, sincosl might boost performance of trig-
heavy numerical codes, i.e. codes involving a lot of coordinate
transformations, i.e. dynamic equations for robotic systems.

Apologies in advance if I'm submitting this to the wrong component; I'm very
much just an end user.  Thanks!

Quuxplusone commented 12 years ago

I think this optimization requires -ffast-math, please correct me if I'm wrong.

Quuxplusone commented 12 years ago

Thanks for the reply Duncan!  I tried, and -ffast-math doesn't result in a call
to sincos either.

$ clang -O3 -ffast-math -S main.c && cat main.s

.Ltmp5:
        .cfi_def_cfa_register %rbp
        subq    $16, %rsp
        movq    (%rsi), %rax
        movss   (%rax), %xmm0
        movss   %xmm0, -8(%rbp)         # 4-byte Spill
        callq   cosf
        movss   %xmm0, -4(%rbp)         # 4-byte Spill
        movss   -8(%rbp), %xmm0         # 4-byte Reload
        callq   sinf
        cvttss2si       %xmm0, %eax
        cvtsi2ss        %eax, %xmm1
        movss   -4(%rbp), %xmm2         # 4-byte Reload
        addss   %xmm2, %xmm1
        addss   %xmm0, %xmm1
        addss   %xmm2, %xmm1
        cvttss2si       %xmm1, %eax
        addq    $16, %rsp
        popq    %rbp
        ret

Quuxplusone commented 12 years ago

I wonder if sincos would be significantly faster than sin and cos even with recent fpu...

I supposed sincos would be 20th fossil when vector bisecting (eg. CORDIC) was faster than multiplier.

Quuxplusone commented 11 years ago

rdar://12856873

Quuxplusone commented 11 years ago

I've implemented the optimization in sdisel. See r173755.

I've only enabled the optimization for Mac OS X (and x86 only for now).  It's
pretty easy to enable it for other targets by teaching
X86Subtarget::hasSinCos() to return true for the target.

        movq    (%rsi), %rax
        movss   (%rax), %xmm0
        callq   ___sincosf_stret
        cvttss2si       %xmm0, %eax
        cvtsi2ssl       %eax, %xmm2
        addss   %xmm1, %xmm2
        addss   %xmm0, %xmm2
        addss   %xmm1, %xmm2
        cvttss2si       %xmm2, %eax
        popq    %rbp
        ret

One potential gotcha is on Mac OS X, the sincos library routine always produces
the same results as separate sin and cos calls. So I haven't implemented checks
for "fast math".

Quuxplusone commented 11 years ago

I made the following change to generate sincos on linux:

Index: X86Subtarget.cpp
===================================================================
--- X86Subtarget.cpp    (revision 173806)
+++ X86Subtarget.cpp    (working copy)
@@ -156,8 +156,12 @@
 }

 bool X86Subtarget::hasSinCos() const {
-  return getTargetTriple().isMacOSX() &&
-    !getTargetTriple().isMacOSXVersionLT(10, 9);
+  if (getTargetTriple().isMacOSX() &&
+      !getTargetTriple().isMacOSXVersionLT(10, 9))
+    return true;
+  if (getTargetTriple().getOS() == Triple::Linux)
+    return true;
+  return false;
 }

 /// IsLegalToCallImmediateAddr - Return true if the subtarget allows calls

Here are the results on smallpt. It's a small improvement.

clang++ -o foo smallpt.cpp -O3 -ffast-math -D__extern_always_inline=inline

$ time ./foo 16
Rendering (16 spp)

real    0m25.498s
user    0m25.410s
sys 0m0.012s

$ time ./foo 16
Rendering (16 spp)

real    0m23.459s
user    0m23.381s
sys 0m0.008s

Quuxplusone commented 11 years ago

Forgot to mention the first run is without the patch.

Quuxplusone commented 10 years ago

I ran into this issue because a discrete FFT compiled with Clang 3.3 (-O3 -ffast-math) turned out 3.5x times slower than GCC 4.8 and 8x slower than ICC 14.0.1. After profiling and replacing two consecutive sin(float) and cos(float) calls with sincosf, Clang performed 1.3x faster than GCC and only 1.8x slower than ICC (which regressed by 1.75x) was.

First of all, I'm using Clang 3.3 on Debian Linux (unstable).

Looking into this, GCC converts the sin/cos calls into a single sincosf as early as -O1, even without -ffast-math. On -O0, sin/cos are called separately, with libm then callinb __sin_avx and __cos_avx respectively.

Clang 3.3, without above patch, combines sin/cos as soon as -O1, but only if -ffast-math is set. However, it seems to lower to sincos rather than sincosf, resulting in libm still calling sin and cos separately (hence gaining no performance at all)!

Related, it seems like the X86Subtarget::hasSinCos functionality is broken/unused: not applying the above patch, or even reducing the entire function to "return false" still resulted in a call to sincos... Maybe this is why the calls are lowered to sincos rather than sincosf, which sdisel seems to support?

I also had a look at ICC: on -O3 (it doesn't recognize -ffast-math) it lowers sin/cos to a custom SVML sincosf implementation, on -O2 it lowers to a custom SSE2 sincosf implementation and on -O1 and below it uses libm sin/cos calls.

Quuxplusone / LLVMBugzillaTest

Optimization pass to combine sin, cos library calls to sincos #13281