Open milthorpe opened 1 year ago
Just a thought, does it help to define those four constants as param
instead of const?
It doesn't, but you're right that they should be param
because they're known at compile time. I've updated the code snippet accordingly.
@milthorpe : Sorry for the lack of response to this sooner. December was a lost month for me, and I'm still digging out of the rubble.
To make sure I'm understanding, was the transformation that you manually applied to get the performance boost essentially to rewrite the loop as:
for i in 0..<N {
const rx: real(32) =
if nums[i] % 3 == 0
then (if nums[i] % 5 == 0 then 1.0 / A else 1.0 / B)
else (if nums[i] % 4 == 0 then 1.0 / C else 1.0 / D);
sum += rx;
}
?
As a guess as to why this is happening, I'd imagine it's because the Chapel compiler currently normalizes conditional expressions prior to code generation in a way that doesn't preserve their compact, expression-like behavior; and that this makes the code sufficiently less clear to LLVM that the optimization for the C code fails to get applied. For example, the generated C code for the conditional above contains conditional statements rather than C ternary expressions:
if (((int64_t)((*(call_tmp_chpl44) % INT64(3)))) == INT64(0)) {
coerce_tmp_chpl5 = (&nums_chpl)->_instance;
coerce_tmp_chpl6 = (coerce_tmp_chpl5)->shiftedData;
call_tmp_chpl45 = (coerce_tmp_chpl6 + i_chpl);
if (((int64_t)((*(call_tmp_chpl45) % INT64(5)))) == INT64(0)) {
tmp_chpl18 = REAL32(0x1p+0);
} else {
tmp_chpl18 = REAL32(0x1.4p+4);
}
tmp_chpl17 = tmp_chpl18;
} else {
coerce_tmp_chpl7 = (&nums_chpl)->_instance;
coerce_tmp_chpl8 = (coerce_tmp_chpl7)->shiftedData;
call_tmp_chpl46 = (coerce_tmp_chpl8 + i_chpl);
if (((int64_t)((*(call_tmp_chpl46) % INT64(4)))) == INT64(0)) {
tmp_chpl19 = REAL32(0x1.99999ap+4);
} else {
tmp_chpl19 = REAL32(0x1.8p+4);
}
tmp_chpl17 = tmp_chpl19;
}
call_tmp_chpl47 = (((_real32)(REAL64(0x1p+0))) / tmp_chpl17);
So, note specifically that tmp_chpl17
now has two definition points rather than one. I don't think there's any inherent reason for this other than that we haven't focused as much in the compiler on clear translation of scalar code as we have on getting the distributed memory and parallel features correct and performing/scaling well.
It would definitely be nice to get to a point where we weren't losing scalar code optimization opportunities from the back-end like this, and I hope we will be able to do so under our current compiler refactoring / rewriting / rearchitecting effort.
Yes, we applied the optimization per your code above. In the full code, we need both the value x
and its inverse, so we actually ended up with two conditional expressions of the same form.
@bradcray -- just a couple of clarifications:
As a guess as to why this is happening, I'd imagine it's because the Chapel compiler currently normalizes conditional expressions prior to code generation in a way that doesn't preserve their compact, expression-like behavior;
I agree with this. But the way we normalize conditional expressions is by turning them into statements. And that means we'd also suffer from the same performance hit in user-written conditional statements. But that also means if we were to write an optimization for constant folding/propagation across conditional statements than that would address the hit on both expressions and statements. Does that align with what you're thinking? IOW, we shouldn't focus on conditional expressions for an optimization like this?
It would definitely be nice to get to a point where we weren't losing scalar code optimization opportunities from the back-end like this, and I hope we will be able to do so under our current compiler refactoring / rewriting / rearchitecting effort.
As always, I am a bit (naively?) more optimistic than this. If I were to guess, I would say that the array metadata is what's confusing the backend here. Hoisting that metadata out of the loop should make backend's life easier optimizing that conditional. We have some recent work on GPU support side to strengthen LICM. If we can generalize that to non-GPU code and add support for reduce intent on foreach loops, maybe something like this could be optimized sooner:
foreach i in 0..<N with (+ reduce sum) {
const x: real(32) =
if nums[i] % 3 == 0
then (if nums[i] % 5 == 0 then A else B)
else (if nums[i] % 4 == 0 then C else D);
const rx: real(32) = 1.0 / x;
sum += rx;
}
Note that the more aggressive LICM optimizations fires only for order-independent loops and that's why this will have to be written as a foreach
. Probably a next level LICM could detect that there's no array modifications in the loop body and can hoist the metadata even with for
loops, but that's probably further in the future.
Hi Engin —
Does that align with what you're thinking? IOW, we shouldn't focus on conditional expressions for an optimization like this?
I hadn't considered that, but can definitely get on-board with it if you think it's tractable and the way to go.
I have been looking into this a little bit and have some interesting data to share.
I have created https://github.com/chapel-lang/chapel/pull/25362 which is what I used to collect all of this performance information, and I plan to merge this as a nightly performance test.
I ran with both LLVM 15 and LLVM 17. With LLVM 15, I replicated the performance regression from this PR. Compiling with -smanualConstFold=false
was much slower than the C version, compiling with -smanualConstFold=true
brought them to parity. But, I don't know that its because the C version had the constant fold optimization. When I inspect the LLVM IR of the C version, I see no evidence of the constant fold optimization. And if I do the manual constant fold to the C version, there is a speedup as well.
I then ran with LLVM 17, and saw Chapel version meet/beat the C version, even with -smanualConstFold=false
. There is a definite performance boost to using -smanualConstFold=true
, but its not required to meet/beat the C version.
While doing all of this, I noticed 2 extra performance peculiarities that I want to bring attention to.
Running with --initArray=true
causes a major slow down to the Chapel version, while it has no effect on the C version. I attribute this to a difference in caching. I confirmed this by running with/without --initArray
with perf stat -e cache-references,cache-misses,cycles
. When the array is initialized before the kernel, there is a 1.2x increase in the number of cache misses. The C version does not have this issue, the cache miss percent is the same in either case. I would attribute this to Chapel's array implementation/indirection, but I can replicate the performance degradation using c_array
so I am unsure the problem. I am not terribly worried about this, as this caching issue is going to be very machine dependent, and may not even be reproducible outside of this microkernel.
There is a slight difference for this microkernel when compiling with --specialize
vs --no-specialize
. I noticed this on the AMD CPU I was running and suspect its machine specific. This is reproducible in the C version and I consider it an LLVM bug, I've opened https://github.com/llvm/llvm-project/issues/96701 for it.
In summary, with newer LLVM versions the Chapel and C versions seem to have parity (baring the caching and --specialize
issues listed above). That said, the constant folding gives a slight edge and may warrant a compiler optimization as suggested above.
@jabraham17 thanks for creating this performance test! My observations on Skylake (the platform on which I originally observed the problem) with Chapel 2.0 and LLVM 17.0.1, compiling with chpl --fast --no-ieee-float -sprintTime=true
-sinitArray=false
, Chapel improves to 14s (with no change for C).
cache-misses
with or without array initialization. branch-misses
, from 0.01% with --initArray=false
to 17.77% with --initArray=true
. I guess that this is due to the kernel loop always following the same branch for all zero values vs. different branches for each initialized value.-smanualConstFold
, Chapel is even worse at 43s!--no-specialize
slows down Chapel (-smanualConstFold=false
) to 63.4sHowever, I observe an orders-of-magnitude difference in branch-misses, from 0.01% with --initArray=false to 17.77% with --initArray=true. I guess that this is due to the kernel loop always following the same branch for all zero values vs. different branches for each initialized value.
I can also replicate this on my system (also explains why c_array
had no perf improvement). Interesting, I wonder why the C version does not have this issue.
With -smanualConstFold, Chapel is even worse at 43s!
This is surprising to me, but maybe not that surprising given that this is a small kernel and is going to be very microarchitecture sensitive
--no-specialize slows down Chapel (-smanualConstFold=false) to 63.4s
I think this confirms my theory that the difference I saw with --specialize
vs --no-specialize
is an AMD specific issue. --specialize
, when working properly, should give performance improvements
With Chapel 2.0 + LLVM 17.0.1, miniBUDE no longer exhibits any improvement from manually applying the constant folding
Does this resolve the original ~75% slowdown?
Sorry - my last observation about miniBUDE was incorrect, as I had forgotten another CPU-vs-GPU optimization that is required. The default version of miniBUDE uses for param for the innermost loops ('poses per work-item'), as this performs best on GPU. However, for CPU, performance is best using foreach . When foreach is used, I still see the 75% performance improvement from manual constant folding. Here's the performance (GFLOP/s) of miniBUDE on Skylake for the last two Chapel versions: |
miniBUDE inner loop implementation | Chapel 2.0 + LLVM 17.0.1 | Chapel 1.33 + LLVM 15.0.7 |
---|---|---|---|
for param |
420 | 468 | |
for param + constant folding |
414 | 482 | |
foreach |
498 | 488 | |
foreach + constant folding |
878 | 823 |
Focusing on foreach with or without constant folding, here's miniBUDE performance in GFLOP/s on three different Intel CPUs: |
platform | foreach |
foreach + constant folding |
---|---|---|---|
Skylake Xeon Gold 6134 | 498 | 878 | |
Cascade Lake Xeon Platinum 8268 | 1573 | 2887 | |
Sapphire Rapids Xeon Platinum 8470Q | 3756 | 6460 |
Summary of Problem
The Chapel compiler fails to perform a constant folding operation to eliminate a high-strength (division) operation inside a loop, whereas clang performs this optimization on the equivalent C code.
Why does this matter?
The original example comes from the miniBUDE proxy application, where a four-way conditional expression determines which of a set of constants is assigned to a given variable, the reciprocal of which is then computed. As the reciprocal of a constant is itself a constant, clang performs compile-time constant folding to eliminate the division operation. For some reason, the llvm backend to the Chapel compiler is not performing the same optimization, which results in a ~75% slowdown for miniBUDE on a typical Intel CPU. (In other words, if the constant folding optimization is performed by hand -- explicitly writing each of the reciprocals as a separate Chapel constant -- the run time for miniBUDE drops by ~40%.)
Steps to Reproduce
A simplified example:
Source Code:
Chapel version:
C version:
Compile command:
With these compiler options on an x86_64 system, the Chapel program contains machine instructions like:
whereas the C program contains no divisions, but only simple additions like:
Configuration Information
chpl --version
:$CHPL_HOME/util/printchplenv --anonymize
:clang --version
: