Open ronawho opened 6 years ago
CCE 8.6.5 is still our default for the module build, but after the 1.18 release, we should explore bumping that up to 8.7 to get the standard atomics.
I'm pleased to see this!
For N-Body
, what do you think is the bottleneck there, just as a guess. It doesn't contain a spin lock, does it?
Here's a smaller reproducer for the performance differences for array-view creation (which I believe is the source of overhead for MiniMD and Nbody).
It just does a slice of a 1D array 25 million times:
use Time;
config const n = 100,
numTrials = 25_000_000;
proc main() {
var A : [1..n] int;
var t : Timer; t.start();
for 1..numTrials {
ref s = A[1..n/2];
if s.size > A.size then halt();
}
t.stop();
writeln(t.elapsed());
}
config | time |
---|---|
intrinsics | 1.9s |
cstdlib | 3.0s |
Could it be the creation of an array of atomic Booleans in LocalRADCache
?
https://github.com/chapel-lang/chapel/blob/59eb20cd6bb909209c77e7dcb2577a8837b76ef1/modules/internal/DefaultRectangular.chpl#L885 https://github.com/chapel-lang/chapel/blob/72bc37a8230d8b08bc92be4701c86edf31b9023e/modules/internal/DefaultRectangular.chpl#L967
But that'd be weird since its only used in distributions. Can't find any other atomics in the Chapel array implementation.
Array and array views are split across several files -- I'm still digging into this, but the RadLocks are not created/used for local (default rectangular) arrays
I've identified the performance difference to be from locking in the base domain and distribution classes. The locking shouldn't be needed for array-views so we should be able to eliminate it. I created https://github.com/chapel-lang/chapel/issues/11619 to optimize out the locking (edit that's not true, the locking is needed after the slice is returned, just not as we're creating the slice.)
I also created https://github.com/chapel-lang/chapel/issues/11620 to track adding a more comprehensive atomic test suite so we can more easily see which operations will be slower/fast under cstdlib
We should be able to eliminate most of the overhead by using acquire/release orderings on our base dom/dist locks (https://github.com/chapel-lang/chapel/issues/11851):
use Time;
config const n = 100,
numTrials = 25_000_000;
proc main() {
var A : [1..n] int;
var t : Timer; t.start();
for 1..numTrials {
ref s = A[1..n/2];
if s.size > A.size then halt();
}
t.stop();
writeln(t.elapsed());
}
config | time |
---|---|
intrinsics | 2.07s |
cstdlib | 3.15s |
cstdlib acq/rel | 2.13s |
gcc performance regressions were resolved in https://github.com/chapel-lang/chapel/issues/11851. We made cstdlib atomics the default for gcc in https://github.com/chapel-lang/chapel/pull/11963 and testing looks good.
We worked around bad headers under clang in https://github.com/chapel-lang/chapel/pull/12040. We made cstdlib atomics the default for clang in https://github.com/chapel-lang/chapel/pull/12042 and testing looks good.
We're not quite ready to make cstdlib atomics the default for intel (https://github.com/chapel-lang/chapel/issues/11954) or cce (https://github.com/chapel-lang/chapel/issues/11953) yet. Intel 18 and cce 8.7.7 have full support but we'll need our gen compiler to catch up to those versions before we can make cstdlib default for the module builds.
We're exploring making cstdlib the default for --llvm
compiles too, and that's looking pretty good so far.
fyi @LouisJenkinsCS
Glad to hear, nice work @ronawho!
Just FYI: up-to-date status is now tracked at the end of the issue description.
Additional details can be found in this gist and the 1.19 release notes. Here's some nice graphs for a 24-core x86 machine:
Serial reads are 20x faster and parallel reads are 800x faster under cstdlib:
Writes are different (slower serial, faster parallel):
Memory order optimizations are possible:
The atomic read speedup is pretty insane (and we didn't know things were so slow before), so thanks @LouisJenkinsCS for bringing this issue to our attention in https://github.com/chapel-lang/chapel/issues/10996.
No problem @ronawho :)
We tried making
cstdlib
the default for gcc in https://github.com/chapel-lang/chapel/pull/4443 but reverted in https://github.com/chapel-lang/chapel/pull/4456 because of performance regressions. We tried makingcstdlib
the default for clang in https://github.com/chapel-lang/chapel/pull/4480 but reverted in https://github.com/chapel-lang/chapel/pull/4536 because of correctness issues (broken system headers)That was ~2 years ago, so we should revisit this and see if things have changed.
Happily, it also looks like both Intel and Cray Compiler have added support for C11 atomics!
We did some recent performance testing (mid July 2018) with gcc 8.1 for intrinsics vs. cstdlib and still saw some performance regressions there.
edit current status (March 2019):
We've made good progress with making cstdlib atomics the default:
Under x86, cstdlib atomics reads are significantly faster. This gist has additional details, but on a 24-core machine, using cstdlib results in a 20x serial speedup, and an 800x parallel speedup.