chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 414 forks source link

Compiler misses constant folding optimization #21229

Open milthorpe opened 1 year ago

milthorpe commented 1 year ago

Summary of Problem

The Chapel compiler fails to perform a constant folding operation to eliminate a high-strength (division) operation inside a loop, whereas clang performs this optimization on the equivalent C code.

Why does this matter?

The original example comes from the miniBUDE proxy application, where a four-way conditional expression determines which of a set of constants is assigned to a given variable, the reciprocal of which is then computed. As the reciprocal of a constant is itself a constant, clang performs compile-time constant folding to eliminate the division operation. For some reason, the llvm backend to the Chapel compiler is not performing the same optimization, which results in a ~75% slowdown for miniBUDE on a typical Intel CPU. (In other words, if the constant folding optimization is performed by hand -- explicitly writing each of the reciprocals as a separate Chapel constant -- the run time for miniBUDE drops by ~40%.)

Steps to Reproduce

A simplified example:

Source Code:

Chapel version:

param A: real(32) = 1.0;
param B: real(32) = 20.0;
param C: real(32) = 25.6;
param D: real(32) = 24.0;
var nums: [0..<N] int;
var sum: real(32) = 0.0;
for i in 0..<N {
    const x: real(32) = 
        if nums[i] % 3 == 0 
        then (if nums[i] % 5 == 0 then A else B)
        else (if nums[i] % 4 == 0 then C else D);

    const rx: real(32) = 1.0 / x;
    sum += rx;
}

C version:

#define A 1.0f
#define B 20.0f
#define C 25.6f
#define D 24.0f
int nums[N];
float sum = 0;
for (int i = 0; i < N; i ++) {
    const float x = nums[i] % 3 == 0 ?
        (nums[i] % 5 == 0 ? A : B) :
        (nums[i] % 4 == 0 ? C : D);
    const float rx = 1.0f / x;
    sum += rx;
}

Compile command:

 chpl --fast --no-ieee-float loop_division.chpl
 clang -Ofast -ffast-math loop_division.c 

With these compiler options on an x86_64 system, the Chapel program contains machine instructions like:

  43da10:   c5 fa 5e 0e             vdivss (%rsi),%xmm0,%xmm1
  43da14:   c5 f2 58 d2             vaddss %xmm2,%xmm1,%xmm2

whereas the C program contains no divisions, but only simple additions like:

  4013b4:   f3 0f 58 c2             addss  %xmm2,%xmm0

Configuration Information

cassella commented 1 year ago

Just a thought, does it help to define those four constants as param instead of const?

milthorpe commented 1 year ago

It doesn't, but you're right that they should be param because they're known at compile time. I've updated the code snippet accordingly.

bradcray commented 1 year ago

@milthorpe : Sorry for the lack of response to this sooner. December was a lost month for me, and I'm still digging out of the rubble.

To make sure I'm understanding, was the transformation that you manually applied to get the performance boost essentially to rewrite the loop as:

for i in 0..<N {
    const rx: real(32) = 
        if nums[i] % 3 == 0 
        then (if nums[i] % 5 == 0 then 1.0 / A else 1.0 / B)
        else (if nums[i] % 4 == 0 then 1.0 / C else 1.0 / D);

    sum += rx;
}

?

As a guess as to why this is happening, I'd imagine it's because the Chapel compiler currently normalizes conditional expressions prior to code generation in a way that doesn't preserve their compact, expression-like behavior; and that this makes the code sufficiently less clear to LLVM that the optimization for the C code fails to get applied. For example, the generated C code for the conditional above contains conditional statements rather than C ternary expressions:

    if (((int64_t)((*(call_tmp_chpl44) % INT64(3)))) == INT64(0)) {
      coerce_tmp_chpl5 = (&nums_chpl)->_instance;
      coerce_tmp_chpl6 = (coerce_tmp_chpl5)->shiftedData;
      call_tmp_chpl45 = (coerce_tmp_chpl6 + i_chpl);
      if (((int64_t)((*(call_tmp_chpl45) % INT64(5)))) == INT64(0)) {
        tmp_chpl18 = REAL32(0x1p+0);
      } else {
        tmp_chpl18 = REAL32(0x1.4p+4);
      }
      tmp_chpl17 = tmp_chpl18;
    } else {
      coerce_tmp_chpl7 = (&nums_chpl)->_instance;
      coerce_tmp_chpl8 = (coerce_tmp_chpl7)->shiftedData;
      call_tmp_chpl46 = (coerce_tmp_chpl8 + i_chpl);
      if (((int64_t)((*(call_tmp_chpl46) % INT64(4)))) == INT64(0)) {
        tmp_chpl19 = REAL32(0x1.99999ap+4);
      } else {
        tmp_chpl19 = REAL32(0x1.8p+4);
      }
      tmp_chpl17 = tmp_chpl19;
    }
    call_tmp_chpl47 = (((_real32)(REAL64(0x1p+0))) / tmp_chpl17);

So, note specifically that tmp_chpl17 now has two definition points rather than one. I don't think there's any inherent reason for this other than that we haven't focused as much in the compiler on clear translation of scalar code as we have on getting the distributed memory and parallel features correct and performing/scaling well.

It would definitely be nice to get to a point where we weren't losing scalar code optimization opportunities from the back-end like this, and I hope we will be able to do so under our current compiler refactoring / rewriting / rearchitecting effort.

milthorpe commented 1 year ago

Yes, we applied the optimization per your code above. In the full code, we need both the value x and its inverse, so we actually ended up with two conditional expressions of the same form.

e-kayrakli commented 5 months ago

@bradcray -- just a couple of clarifications:

As a guess as to why this is happening, I'd imagine it's because the Chapel compiler currently normalizes conditional expressions prior to code generation in a way that doesn't preserve their compact, expression-like behavior;

I agree with this. But the way we normalize conditional expressions is by turning them into statements. And that means we'd also suffer from the same performance hit in user-written conditional statements. But that also means if we were to write an optimization for constant folding/propagation across conditional statements than that would address the hit on both expressions and statements. Does that align with what you're thinking? IOW, we shouldn't focus on conditional expressions for an optimization like this?

It would definitely be nice to get to a point where we weren't losing scalar code optimization opportunities from the back-end like this, and I hope we will be able to do so under our current compiler refactoring / rewriting / rearchitecting effort.

As always, I am a bit (naively?) more optimistic than this. If I were to guess, I would say that the array metadata is what's confusing the backend here. Hoisting that metadata out of the loop should make backend's life easier optimizing that conditional. We have some recent work on GPU support side to strengthen LICM. If we can generalize that to non-GPU code and add support for reduce intent on foreach loops, maybe something like this could be optimized sooner:

foreach i in 0..<N with (+ reduce sum) {
    const x: real(32) = 
        if nums[i] % 3 == 0 
        then (if nums[i] % 5 == 0 then A else B)
        else (if nums[i] % 4 == 0 then C else D);

    const rx: real(32) = 1.0 / x;
    sum += rx;
}

Note that the more aggressive LICM optimizations fires only for order-independent loops and that's why this will have to be written as a foreach. Probably a next level LICM could detect that there's no array modifications in the loop body and can hoist the metadata even with for loops, but that's probably further in the future.

bradcray commented 5 months ago

Hi Engin —

Does that align with what you're thinking? IOW, we shouldn't focus on conditional expressions for an optimization like this?

I hadn't considered that, but can definitely get on-board with it if you think it's tractable and the way to go.

jabraham17 commented 1 week ago

I have been looking into this a little bit and have some interesting data to share.

I have created https://github.com/chapel-lang/chapel/pull/25362 which is what I used to collect all of this performance information, and I plan to merge this as a nightly performance test.

I ran with both LLVM 15 and LLVM 17. With LLVM 15, I replicated the performance regression from this PR. Compiling with -smanualConstFold=false was much slower than the C version, compiling with -smanualConstFold=true brought them to parity. But, I don't know that its because the C version had the constant fold optimization. When I inspect the LLVM IR of the C version, I see no evidence of the constant fold optimization. And if I do the manual constant fold to the C version, there is a speedup as well.

I then ran with LLVM 17, and saw Chapel version meet/beat the C version, even with -smanualConstFold=false. There is a definite performance boost to using -smanualConstFold=true, but its not required to meet/beat the C version.

While doing all of this, I noticed 2 extra performance peculiarities that I want to bring attention to.

  1. Running with --initArray=true causes a major slow down to the Chapel version, while it has no effect on the C version. I attribute this to a difference in caching. I confirmed this by running with/without --initArray with perf stat -e cache-references,cache-misses,cycles. When the array is initialized before the kernel, there is a 1.2x increase in the number of cache misses. The C version does not have this issue, the cache miss percent is the same in either case. I would attribute this to Chapel's array implementation/indirection, but I can replicate the performance degradation using c_array so I am unsure the problem. I am not terribly worried about this, as this caching issue is going to be very machine dependent, and may not even be reproducible outside of this microkernel.

  2. There is a slight difference for this microkernel when compiling with --specialize vs --no-specialize. I noticed this on the AMD CPU I was running and suspect its machine specific. This is reproducible in the C version and I consider it an LLVM bug, I've opened https://github.com/llvm/llvm-project/issues/96701 for it.

In summary, with newer LLVM versions the Chapel and C versions seem to have parity (baring the caching and --specialize issues listed above). That said, the constant folding gives a slight edge and may warrant a compiler optimization as suggested above.

milthorpe commented 1 week ago

@jabraham17 thanks for creating this performance test! My observations on Skylake (the platform on which I originally observed the problem) with Chapel 2.0 and LLVM 17.0.1, compiling with chpl --fast --no-ieee-float -sprintTime=true

jabraham17 commented 1 week ago

However, I observe an orders-of-magnitude difference in branch-misses, from 0.01% with --initArray=false to 17.77% with --initArray=true. I guess that this is due to the kernel loop always following the same branch for all zero values vs. different branches for each initialized value.

I can also replicate this on my system (also explains why c_array had no perf improvement). Interesting, I wonder why the C version does not have this issue.

With -smanualConstFold, Chapel is even worse at 43s!

This is surprising to me, but maybe not that surprising given that this is a small kernel and is going to be very microarchitecture sensitive

--no-specialize slows down Chapel (-smanualConstFold=false) to 63.4s

I think this confirms my theory that the difference I saw with --specialize vs --no-specialize is an AMD specific issue. --specialize, when working properly, should give performance improvements

With Chapel 2.0 + LLVM 17.0.1, miniBUDE no longer exhibits any improvement from manually applying the constant folding

Does this resolve the original ~75% slowdown?

milthorpe commented 1 week ago
Sorry - my last observation about miniBUDE was incorrect, as I had forgotten another CPU-vs-GPU optimization that is required. The default version of miniBUDE uses for param for the innermost loops ('poses per work-item'), as this performs best on GPU. However, for CPU, performance is best using foreach. When foreach is used, I still see the 75% performance improvement from manual constant folding. Here's the performance (GFLOP/s) of miniBUDE on Skylake for the last two Chapel versions: miniBUDE inner loop implementation Chapel 2.0 + LLVM 17.0.1 Chapel 1.33 + LLVM 15.0.7
for param 420 468
for param + constant folding 414 482
foreach 498 488
foreach + constant folding 878 823
milthorpe commented 1 week ago
Focusing on foreach with or without constant folding, here's miniBUDE performance in GFLOP/s on three different Intel CPUs: platform foreach foreach + constant folding
Skylake Xeon Gold 6134 498 878
Cascade Lake Xeon Platinum 8268 1573 2887
Sapphire Rapids Xeon Platinum 8470Q 3756 6460