Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missed optimization in math expression: sqrt(sqrt(a)) == pow(a, 1/4) #34573

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR35600
Status NEW
Importance P enhancement
Reported by Alexander Zaitsev (zamazan4ik@tut.by)
Reported on 2017-12-10 06:52:19 -0800
Last modified on 2018-07-31 09:27:58 -0700
Version trunk
Hardware PC All
CC hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks PR35611
Blocked by
See also
clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' flags for this
code:

#include <cmath>

double test(double a)
{
    return sqrt(sqrt(a));
}

generates this assembly:

test(double): # @test(double)
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  ret

But there is formula: sqrt(sqrt(a)) == pow(a, 1/4). And it can be compiled in
faster way.
Quuxplusone commented 6 years ago

Is calling pow actually faster? I'd think that the optimization here might actually go the other way?

Quuxplusone commented 6 years ago
if you will check this code:

#include <cmath>

double test(double a)
{
    return sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a)))))));
}

clang(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates here:

test(double): # @test(double)
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  vsqrtsd xmm0, xmm0, xmm0
  ret

but gcc(trunk) with '--std=c++17 -O3 -march=native -ffast-math' generates:

test(double):
        vandpd  xmm0, xmm0, XMMWORD PTR .LC1[rip]
        vmovsd  xmm1, QWORD PTR .LC0[rip]
        jmp     __pow_finite
.LC0:
        .long   0
        .long   1065353216
.LC1:
        .long   4294967295
        .long   2147483647
        .long   0
        .long   0

And here i am sure that pow is faster than chain of sqrt functions.
Quuxplusone commented 6 years ago
Yes, many sqrts might be slower - but the 2*sqrt case is very likely to be
quicker.

Some GPUs might be able to do this with a dedicated pow instruction, and a soft
pow implementation might avoid a bottleneck on a sqrt unit but trying to
compare costs of hw instructions AND libm implementations is likely to be a
nightmare.
Quuxplusone commented 6 years ago

See a lot of optimizations here: https://github.com/gcc-mirror/gcc/blob/07b69d3f1cd3dd8ebb0af1fbff95914daee477d2/gcc/match.pd

Quuxplusone commented 6 years ago
There's a proposal to convert pow(x, 0.25) to sqrt(sqrt(x)) here:
https://reviews.llvm.org/D49306