Closed ronawho closed 5 years ago
I see you found it. :-)
I'm currently mulling: is it reasonable to expect programmers to to understand that this is what will happen and what the consequences are, or is this something (or an instance of a larger something) that we should be dealing with better somewhere in the Chapel software stack (module code or runtime)? I'm on the fence, but I think leaning toward a combination of the two. It seems like the module code that I assume implements that max reduction should be doing a 2-level nested loop (across nodes, then across cores within each node) for the reduction rather than what appears to be a single coforall across all the cores in the program. And if a programmer really wants to code a single-level coforall across all the cores in the program they can do so, but should be expected to understand the consequences.
I thought that we converted reduce expressions into loops with reduce intents? (@vasslitvinov?)
Currently on master the "loop with reduce intents" that a reduce expression is converted over is a lame one. It does not use the ForallStmt-based implementation that a forall loop uses. Switching reduce expressions to ForallStmt-based implementation is my WIP.
Simpler reproducer with with just comm diags (this should be easier to make into a test):
use BlockDist, CommDiagnostics;
config const m = 1000000;
const Space = {1..m} dmapped Block({1..m});
var A: [Space] real;
{ writeln("Max reduce:");
resetCommDiagnostics(); startCommDiagnostics();
var r = max reduce forall (a, a1) in zip(A, A) do 1;
stopCommDiagnostics(); printCommDiags(); }
{ writeln("\nReduce intent");
resetCommDiagnostics(); startCommDiagnostics();
var r: real;
forall (a, a1) in zip(A, A) with (max reduce r) do r = max(r, 1);
stopCommDiagnostics(); printCommDiags(); }
proc printCommDiags() {
const diags = getCommDiagnostics();
for (loc, diag) in zip (Locales, diags) do
writeln(loc, " initiated ", diag.execute_on, " AMs");
}
Max reduce:
LOCALE0 initiated 0 AMs
LOCALE1 initiated 28 AMs
LOCALE2 initiated 28 AMs
LOCALE3 initiated 28 AMs
Reduce intent
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs
In the reduce intent case we'll still create numLocales tasks on locale 0, but that's a lot more scalable than numLocales*maxTaskPar tasks.
@vasslitvinov: That's what I was hoping you'd say, thanks!
Also, to mention something that wasn't clear to me in Elliot's description above: Even the better-behaved forall loop with reduce intent case creates a number of tasks equal to the number of locales (just not the number of locales * maxTaskPar). Something we'll need to address eventually...
[oops, he just clarified it too as I was typing this]
How many tasks do we get if we take the reduce intent out?
How many tasks do we get if we take the reduce intent out?
Remote locales will not creates any tasks on locale 0 if you remove the reduce intent.
For a reduction, each locale needs to send its partial result to its parent. So 1 extra AM per locale makes sense. Of course we can do better, ex. with tree-based remote task spawning.
[after-thought] or if we implement GPU-style aggregation of partial results.
For a reduction, each locale needs to send its partial result to its parent. So 1 extra AM per locale makes sense. Of course we can do better, ex. with tree-based remote task spawning.
Right, for the current implementation, 1 AM per locale makes sense and is slightly more scaleable than numLocales*masTaskPar in the reduce expr case. Longer term I agree we need tree-based reductions and/or hardware assisted reductions (i.e. aries supports a max atomic op, so if you did the final accumulation into an atomic we could use that and not create any tasks.)
Switching reduce expressions to ForallStmt-based implementation is my WIP.
What's the timeframe for completing that?
Switching reduce expressions to ForallStmt-based implementation is my WIP.
My optimistic forecast is by year-end.
This has been resolved by https://github.com/chapel-lang/chapel/pull/12131
The reproducer in https://github.com/chapel-lang/chapel/issues/11820#issuecomment-445353457 now reports:
Max reduce:
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs
Reduce intent
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs
A max reduction on a forall over a block distributed array ends up creating an insane number of tasks on locale 0. This causes OOM for stream-global at higher scales (and currently requires us to set a 128K stack sizes to avoid the OOM):
https://github.com/chapel-lang/chapel/blob/e567484902722e283cc45f86b03158f0b0c6106b/test/release/examples/benchmarks/hpcc/stream.chpl#L138
Here's a smaller reproducer that uses VisualDebug to track task creation:
The forall intent version does not seem to have this problem: