Improve performance of naive bale histogram benchmark

ronawho commented 6 years ago

Bale is a collection of communication libraries and mini apps. The simplest code is the histogram benchmark. Today we're between 5x and 10x slower than the reference UPC version, primarily because we don't implement non-blocking atomics.

Here's the core kernel for the naive histogram in upc:

 for(i = 0; i < T; i++) {
#pragma pgas defer_sync
    counts[index[i]] -= v;
  }
  lgp_barrier();

where the #pragma pgas defer_sync allows the UPC compiler to use non-blocking atomics.

Ideally we'd want to write something like:

forall r in rindex {
  A[r].add(1, order=memory_order_relaxed));
}

but today we don't do anything with the relaxed memory order hint, so we end up doing blocking atomic operations.

For 2 million updates and 1000 table entries per thread/task the reference version runs in roughly 1.2 seconds and we're around 5.4 seconds, so we're over 4x slower. If you remove the #pragma pgas defer_sync from the reference version it runs in ~4.2 seconds, so it seems like the main reason we're behind is the lack of non-blocking atomics, though there is also probably some room to just improve the speed of blocking atomics as well.

I think we need performance to be within 1.5x of the reference version to really be competitive here.

ronawho commented 6 years ago

Here's how I'm building/running the reference version:

git clone git@github.com:jdevinney/bale.git
cd bale

# setup cray compiler as the upc compiler
module unload $(module list -t 2>&1 | grep PrgEnv-)
module load PrgEnv-cray
export UPC=cc
export UPCFLAGS="-hupc"

# configure scripts seem to be missing execute permission
find . -name configure | xargs chmod +x 

# -f forces autoconf remake (configure scripts don't work out of the box for me)
# -c just passes flags onto the configure scripts
export PLATFORM=cray
sh install.sh -f -c CC=cc

And to run the histogram code on 28-core (56 thread) broadwell compute nodes:

export PLATFORM=cray
export CORES_PER_NODE=28
export NODES=2

srun --nodes=$NODES --ntasks=$(($NODES*CORES_PER_NODE)) --ntasks-per-node=$CORES_PER_NODE --constraint=BW28 ./build_$PLATFORM/bin/histo -M 1

ronawho commented 6 years ago

Here's a roughly equivalent Chapel version of histo_agi:

use Time;
use CyclicDist;
use BlockDist;
use Random;

config const N=2000000 * here.maxTaskPar; // number of updates
config const M=1000 * here.maxTaskPar * numLocales; // size of table

 // allocate main table and array of random ints
const Mspace = {0..M-1};
const D = Mspace dmapped Cyclic(startIdx=Mspace.low);
var A: [D] atomic int;

const Nspace = {0..(N*numLocales - 1)};
const D2 = Nspace dmapped Block(Nspace);
var rindex: [D2] int;

/* set up loop */
fillRandom(rindex, 208); // the 208 is a seed
forall r in rindex {
 r = mod(r, M);
}

var t: Timer;
t.start();
/* main loop */
forall r in rindex {
  A[r].add(1); //atomic add
}
t.stop();
writeln(t.elapsed());

bradcray commented 6 years ago

I was thinking about the following Chapel code last night:

forall r in rindex {
  A[r].add(1);
}

and wondering whether a (sufficiently capable) Chapel compiler could automatically turn this .add() into a non-blocking case due to the fact that it sees it's within a forall loop with no additional memory-fence operations following it before the one implied by the loop itself? I.e., if the semantics of the forall indicate that all iterations can be run simultaneously, given sufficient resources, and nothing else within the iteration is found to rely on the result, it seems the compiler has the semantic information it would need to use optimize this to use a non-blocking add. Does that seem sound, or am I missing something?

(of course, teaching the current compiler architecture to recognize and optimize this case may have additional challenges...)

ronawho commented 6 years ago

It may be possible/legal, though I would would need to think about that for a while.

That said, even if it is possible, these sorts of optimizations always seem fickle to me and I don't think it's worthwhile in the near term. e.g. if you wrap the routine in a function, the compiler may lose the ability to optimize. So something like:

proc myproc (A, r) { A[r].add(1); }
forall r in rindex  {
  myproc(A, r);
}

may not be optimized. As a concrete example this type of "optimization" (via a pragma) in UPC only works in pretty direct cases:

// gets turned into NB
for(i = 0; i < T; i++) {
#pragma pgas defer_sync
  counts[index[i]] -= v;
}

// does _not_ get turned into NB
for(i = 0; i < T; i++) {
#pragma pgas defer_sync
  lgp_atomic_add(counts, index[i], -v);
}

That's not to say I think we should never pursue this type of optimization, but I think it's hard to implement and "unreliable" in the sense that small source code changes can thwart it so it'll probably only work in simple/direct cases, which makes me think it's not worth pursuing in the near term.

ronawho commented 6 years ago

The histo agi benchmark was added in https://github.com/chapel-lang/chapel/pull/9932 and can be seen here: https://github.com/chapel-lang/chapel/blob/master/test/studies/bale/histogram/histo-atomics.chpl

We optimized blocking network atomics as part of https://github.com/chapel-lang/chapel/issues/9835 and our performance is on par with the blocking UPC version (the upc code if you drop the #pragma pgas defer_sync)

We implemented buffered atomics in https://github.com/chapel-lang/chapel/pull/10702, which offer performance on par with the non-blocking UPC version. Currently this is written like:

use BufferedAtomics;
forall r in rindex do
  A[r].addBuff(1);
flushAtomicBuff();

where manual flushing is required at the end, and addBuff is used instead of add with a relaxed ordering. Longer term we hope to automatically flush at the end of task joins or on-stmts, but for now the manual flushing was simplest to implement. Running on 16 nodes of a cray-xc with 28-core broadwells I see ~360 MB/s per node for the buffered Chapel version and ~355 MB/s per node for the non-blocking UPC version.

These buffered atomics are currently implemented using ugni chained transactions, which offer significantly improved transaction rate. However, we know that some of our users are on a CLE version which doesn't support chained transactions so we'll need to fall back to non-blocking atomics in that case. That work as well as some other cleanups and improvements are captured in https://github.com/chapel-lang/chapel/issues/10743

Taking a pretty graph from the PR -- buffered atomics also significantly improve the performance of RA:

ra-perf

I'm going to leave this open since there are some followups we want to do, but I think the performance aspect is done.

ronawho commented 6 years ago

The short term TODOs in #10743 are done.

Documentation for the current buffered atomics is available at https://chapel-lang.org/docs/1.18//modules/packages/BufferedAtomics.html (though I expect the interface to change since routines like addBuff aren't very elegant and reveal too much about the implementation.)

Still I believe the performance aspect of this is done, so I'm going to close this issue and followup work for a better API will be captured in #10743 or other issues.

chapel-lang / chapel

Improve performance of naive bale histogram benchmark #9782