[FastMath] Failure to fold 1 / powf(x,y) -> powf(x, -y)

Quuxplusone commented 3 years ago


Bugzilla Link	PR49147
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Simon Pilgrim (llvm-dev@redking.me.uk)
Reported on	2021-02-11 06:30:21 -0800
Last modified on	2021-02-27 02:54:47 -0800
Version	trunk
Hardware	PC Windows NT
CC	llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

#include <cmath>

float invpow_i(float x, int y) {
    return 1.0f / __builtin_powf(x, y);
}
float invpow_f(float x, float y) {
    return 1.0f / __builtin_powf(x, y);
}

https://simd.godbolt.org/z/YxaW96

Clang fails to merge the fdiv into the pow by negating the exponent arg

Quuxplusone commented 3 years ago

It should be easy enough to get these in instcombine.

There's an open question about exactly which FMF should be matched. Candidates
are any/all of:
reassoc
arcp
afn

https://llvm.org/docs/LangRef.html#fast-math-flags

In practical terms, once the programmer allows "reassoc", they should be
prepared for anything, but we try to be safer on most fdiv transforms and also
check "arcp".

Quuxplusone commented 3 years ago

https://reviews.llvm.org/D96648

I left "powi" off of that because I'm not sure how we should deal with the case where the exponent is 0x8000000 (signed min i32). gcc doesn't seem to care about that possibility?

Quuxplusone commented 3 years ago

(In reply to Sanjay Patel from comment #2)
> https://reviews.llvm.org/D96648
>
> I left "powi" off of that because I'm not sure how we should deal with the
> case where the exponent is 0x8000000 (signed min i32). gcc doesn't seem to
> care about that possibility?

Looking at the code where these snippets came from, technically we could use
valuetracking to check the min/max bounds of the int - although I think more
likely we'll end up having to convert to a float and negate that, which lose us
any perf benefit.

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #3)
> (In reply to Sanjay Patel from comment #2)
> > https://reviews.llvm.org/D96648
> >
> > I left "powi" off of that because I'm not sure how we should deal with the
> > case where the exponent is 0x8000000 (signed min i32). gcc doesn't seem to
> > care about that possibility?
>
> Looking at the code where these snippets came from, technically we could use
> valuetracking to check the min/max bounds of the int - although I think more
> likely we'll end up having to convert to a float and negate that, which lose
> us any perf benefit.

Make sure I'm seeing it correctly - in the general case, we need to decide
which of these has better perf:
        callq   __powisf2@PLT
        divss   %xmm0, %xmm1

Or:
        pxor    %xmm1, %xmm1
        cvtsi2ssl       %edi, %xmm1
        xorps   .LC0(%rip), %xmm1
        jmp     powf

So we have to know if the divss costs more than the presumed savings of the
"powi" call vs. the "powf" call.

In IR (instcombine), we could say that getting rid of the fdiv is worth adding
2 extra instructions (fneg + sitofp).

I don't know where in the gcc pipeline this is implemented, but that's what
they seem to have done:
https://simd.godbolt.org/z/WjWrv1

Quuxplusone commented 3 years ago

Interestingly, I'm not sure compiler-rt accounts for MIN_INT in powi:

https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/powisf2.c

Quuxplusone commented 3 years ago

powi() isn't part of the standard math lib, so anything goes?

https://gcc.gnu.org/onlinedocs/gcc-10.2.0/gcc/Other-Builtins.html#Other-Builtins

— Built-in Function: double __builtin_powi (double, int)

    Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Quuxplusone commented 3 years ago

We probably want to add at least these 2 related folds for exp() and exp2()?

diff --git a/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
b/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
index 9bc566ed3523..702f76da2774 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
@@ -1326,8 +1326,9 @@ Instruction *InstCombinerImpl::visitFDiv(BinaryOperator
&I) {
     if (match(Op1, m_FDiv(m_SpecificFP(1.0), m_Value(Y))))
       return BinaryOperator::CreateFMulFMF(Y, Op0, &I);

-    // Negate the exponent of pow to fold division-by-pow() into multiply:
+    // Negate the exponent of pow/exp to fold division-by-pow() into multiply:
     // Z / pow(X, Y) --> Z * pow(X, -Y)
+    // Z / exp{2}(Y) --> Z * exp{2}(-Y)
     // In the general case, this creates an extra instruction, but fmul allows
     // for better canonicalization and optimization than fdiv.
     if (match(Op1,
@@ -1336,6 +1337,16 @@ Instruction *InstCombinerImpl::visitFDiv(BinaryOperator
&I) {
       Value *Pow = Builder.CreateBinaryIntrinsic(Intrinsic::pow, X, NegY, &I);
       return BinaryOperator::CreateFMulFMF(Op0, Pow, &I);
     }
+    if (match(Op1, m_OneUse(m_Intrinsic<Intrinsic::exp>(m_Value(Y))))) {
+      Value *NegY = Builder.CreateFNegFMF(Y, &I);
+      Value *Pow = Builder.CreateUnaryIntrinsic(Intrinsic::exp, NegY, &I);
+      return BinaryOperator::CreateFMulFMF(Op0, Pow, &I);
+    }
+    if (match(Op1, m_OneUse(m_Intrinsic<Intrinsic::exp2>(m_Value(Y))))) {
+      Value *NegY = Builder.CreateFNegFMF(Y, &I);
+      Value *Pow = Builder.CreateUnaryIntrinsic(Intrinsic::exp2, NegY, &I);
+      return BinaryOperator::CreateFMulFMF(Op0, Pow, &I);
+    }
   }

   if (I.hasAllowReassoc() && Op0->hasOneUse() && Op1->hasOneUse()) {

Quuxplusone commented 3 years ago

(In reply to Sanjay Patel from comment #7)
> We probably want to add at least these 2 related folds for exp() and exp2()?

gcc has those, so I'll add some tests and push that:
https://simd.godbolt.org/z/8KMGTf

Quuxplusone commented 3 years ago

Adding the exp/exp2 cases would be awesome - thanks.

It would be useful to support powi if at all possible, given the vagueness of the builtin - maybe if we just add something to the langref 'powi(x, INT_MIN) is undefined' and we always do the negation?

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #9)
> Adding the exp/exp2 cases would be awesome - thanks.

Added with:
https://reviews.llvm.org/rGe772618f1ee2

> It would be useful to support powi if at all possible, given the vagueness
> of the builtin - maybe if we just add something to the langref 'powi(x,
> INT_MIN) is undefined' and we always do the negation?

I'm still not clear on when it is optimal to convert powi to powf. I haven't
tried to benchmark it yet, but I thought powi is (much) faster given that it's
just a loop of fmul. Is there a larger example where we see powf as the winner?

Quuxplusone commented 3 years ago

(In reply to Sanjay Patel from comment #10)
> (In reply to Simon Pilgrim from comment #9)
> > Adding the exp/exp2 cases would be awesome - thanks.
>
> Added with:
> https://reviews.llvm.org/rGe772618f1ee2
>
> > It would be useful to support powi if at all possible, given the vagueness
> > of the builtin - maybe if we just add something to the langref 'powi(x,
> > INT_MIN) is undefined' and we always do the negation?
>
> I'm still not clear on when it is optimal to convert powi to powf. I haven't
> tried to benchmark it yet, but I thought powi is (much) faster given that
> it's just a loop of fmul. Is there a larger example where we see powf as the
> winner?

Whats still stopping us just mapping 1/powi(x,i) -> powi(x,-i) ?

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #11)
> (In reply to Sanjay Patel from comment #10)
> > (In reply to Simon Pilgrim from comment #9)
> > > Adding the exp/exp2 cases would be awesome - thanks.
> >
> > Added with:
> > https://reviews.llvm.org/rGe772618f1ee2
> >
> > > It would be useful to support powi if at all possible, given the vagueness
> > > of the builtin - maybe if we just add something to the langref 'powi(x,
> > > INT_MIN) is undefined' and we always do the negation?
> >
> > I'm still not clear on when it is optimal to convert powi to powf. I haven't
> > tried to benchmark it yet, but I thought powi is (much) faster given that
> > it's just a loop of fmul. Is there a larger example where we see powf as the
> > winner?
>
> Whats still stopping us just mapping 1/powi(x,i) -> powi(x,-i) ?

I think that case is fine except for the potential INT_MIN corner case. Note
that gcc is not getting that transform in the example here. So I was confused
in my earlier comments - we're never going to convert powi to powf; gcc is
missing an opportunity to convert powf into powi.

I don't know how to prove this, but I wonder if we could say that for any
supported LLVM FP types powi(X, 0x80000000) is either 0.0, 1.0, or Inf (and so
the reciprocal is Inf, 1.0, or 0.0).

If that holds, it's safe to do the transform with -ffast-math because that
implies 'ninf'?

Quuxplusone commented 3 years ago

I think the combined disclaimers on powi and FMF allow this:
https://reviews.llvm.org/rGa7cee55762c6

So that's the last item for this bug on my list.

Quuxplusone commented 3 years ago

Awesome - thanks @spatel!

Quuxplusone / LLVMBugzillaTest

[FastMath] Failure to fold 1 / powf(x,y) -> powf(x, -y) #48116