chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Try making CHPL_ATOMICS=cstdlib the default again #11029

Open ronawho opened 6 years ago

ronawho commented 6 years ago

We tried making cstdlib the default for gcc in https://github.com/chapel-lang/chapel/pull/4443 but reverted in https://github.com/chapel-lang/chapel/pull/4456 because of performance regressions. We tried making cstdlib the default for clang in https://github.com/chapel-lang/chapel/pull/4480 but reverted in https://github.com/chapel-lang/chapel/pull/4536 because of correctness issues (broken system headers)

That was ~2 years ago, so we should revisit this and see if things have changed.

Happily, it also looks like both Intel and Cray Compiler have added support for C11 atomics!

We did some recent performance testing (mid July 2018) with gcc 8.1 for intrinsics vs. cstdlib and still saw some performance regressions there.

edit current status (March 2019):

We've made good progress with making cstdlib atomics the default:

Under x86, cstdlib atomics reads are significantly faster. This gist has additional details, but on a 24-core machine, using cstdlib results in a 20x serial speedup, and an 800x parallel speedup.

dmk42 commented 6 years ago

CCE 8.6.5 is still our default for the module build, but after the 1.18 release, we should explore bumping that up to 8.7 to get the standard atomics.

LouisJenkinsCS commented 6 years ago

I'm pleased to see this!

For N-Body, what do you think is the bottleneck there, just as a guess. It doesn't contain a spin lock, does it?

ronawho commented 5 years ago

Here's a smaller reproducer for the performance differences for array-view creation (which I believe is the source of overhead for MiniMD and Nbody).

It just does a slice of a 1D array 25 million times:

use Time;

config const n = 100,
             numTrials = 25_000_000;

proc main() {
  var A : [1..n] int;

  var t : Timer; t.start();
  for 1..numTrials {
    ref s = A[1..n/2];
    if s.size > A.size then halt();
  }
  t.stop();

  writeln(t.elapsed());
}
config time
intrinsics 1.9s
cstdlib 3.0s
LouisJenkinsCS commented 5 years ago

Could it be the creation of an array of atomic Booleans in LocalRADCache?

https://github.com/chapel-lang/chapel/blob/59eb20cd6bb909209c77e7dcb2577a8837b76ef1/modules/internal/DefaultRectangular.chpl#L885 https://github.com/chapel-lang/chapel/blob/72bc37a8230d8b08bc92be4701c86edf31b9023e/modules/internal/DefaultRectangular.chpl#L967

But that'd be weird since its only used in distributions. Can't find any other atomics in the Chapel array implementation.

ronawho commented 5 years ago

Array and array views are split across several files -- I'm still digging into this, but the RadLocks are not created/used for local (default rectangular) arrays

ronawho commented 5 years ago

I've identified the performance difference to be from locking in the base domain and distribution classes. The locking shouldn't be needed for array-views so we should be able to eliminate it. I created https://github.com/chapel-lang/chapel/issues/11619 to optimize out the locking (edit that's not true, the locking is needed after the slice is returned, just not as we're creating the slice.)

I also created https://github.com/chapel-lang/chapel/issues/11620 to track adding a more comprehensive atomic test suite so we can more easily see which operations will be slower/fast under cstdlib

ronawho commented 5 years ago

We should be able to eliminate most of the overhead by using acquire/release orderings on our base dom/dist locks (https://github.com/chapel-lang/chapel/issues/11851):

use Time;

config const n = 100,
             numTrials = 25_000_000;

proc main() {
  var A : [1..n] int;

  var t : Timer; t.start();
  for 1..numTrials {
    ref s = A[1..n/2];
    if s.size > A.size then halt();
  }
  t.stop();

  writeln(t.elapsed());
}
config time
intrinsics 2.07s
cstdlib 3.15s
cstdlib acq/rel 2.13s
ronawho commented 5 years ago

gcc performance regressions were resolved in https://github.com/chapel-lang/chapel/issues/11851. We made cstdlib atomics the default for gcc in https://github.com/chapel-lang/chapel/pull/11963 and testing looks good.

We worked around bad headers under clang in https://github.com/chapel-lang/chapel/pull/12040. We made cstdlib atomics the default for clang in https://github.com/chapel-lang/chapel/pull/12042 and testing looks good.

We're not quite ready to make cstdlib atomics the default for intel (https://github.com/chapel-lang/chapel/issues/11954) or cce (https://github.com/chapel-lang/chapel/issues/11953) yet. Intel 18 and cce 8.7.7 have full support but we'll need our gen compiler to catch up to those versions before we can make cstdlib default for the module builds.

We're exploring making cstdlib the default for --llvm compiles too, and that's looking pretty good so far.

fyi @LouisJenkinsCS

LouisJenkinsCS commented 5 years ago

Glad to hear, nice work @ronawho!

ronawho commented 5 years ago

Just FYI: up-to-date status is now tracked at the end of the issue description.

Additional details can be found in this gist and the 1.19 release notes. Here's some nice graphs for a 24-core x86 machine:

The atomic read speedup is pretty insane (and we didn't know things were so slow before), so thanks @LouisJenkinsCS for bringing this issue to our attention in https://github.com/chapel-lang/chapel/issues/10996.

LouisJenkinsCS commented 5 years ago

No problem @ronawho :)