chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 423 forks source link

Max reduction on BlockDist forall creates O(remoteTasks) tasks on locale 0 #11820

Closed ronawho closed 5 years ago

ronawho commented 5 years ago

A max reduction on a forall over a block distributed array ends up creating an insane number of tasks on locale 0. This causes OOM for stream-global at higher scales (and currently requires us to set a 128K stack sizes to avoid the OOM):

https://github.com/chapel-lang/chapel/blob/e567484902722e283cc45f86b03158f0b0c6106b/test/release/examples/benchmarks/hpcc/stream.chpl#L138


Here's a smaller reproducer that uses VisualDebug to track task creation:

use BlockDist, VisualDebug;

config const m = 1000000;

const Space = {1..m} dmapped Block({1..m});
var A: [Space] real;

startVdebug("max_reduce");
var r = max reduce forall (a, a1) in zip(A, A) do 1;
stopVdebug();
./max_reduce -nl 40 # 28 core nodes
grep Btask max_reduce/max_reduce-0 | wc -l # count task begin events
1129 # ~numLocales*here.maxTaskPar

The forall intent version does not seem to have this problem:

var r: real;
forall (a, a1) in zip(A, A) with (max reduce r) do r = max(r, 1);
./max_reduce -nl 40
grep Btask max_reduce/max_reduce-0 | wc -l
76
gbtitus commented 5 years ago

I see you found it. :-)

gbtitus commented 5 years ago

I'm currently mulling: is it reasonable to expect programmers to to understand that this is what will happen and what the consequences are, or is this something (or an instance of a larger something) that we should be dealing with better somewhere in the Chapel software stack (module code or runtime)? I'm on the fence, but I think leaning toward a combination of the two. It seems like the module code that I assume implements that max reduction should be doing a 2-level nested loop (across nodes, then across cores within each node) for the reduction rather than what appears to be a single coforall across all the cores in the program. And if a programmer really wants to code a single-level coforall across all the cores in the program they can do so, but should be expected to understand the consequences.

bradcray commented 5 years ago

I thought that we converted reduce expressions into loops with reduce intents? (@vasslitvinov?)

vasslitvinov commented 5 years ago

Currently on master the "loop with reduce intents" that a reduce expression is converted over is a lame one. It does not use the ForallStmt-based implementation that a forall loop uses. Switching reduce expressions to ForallStmt-based implementation is my WIP.

ronawho commented 5 years ago

Simpler reproducer with with just comm diags (this should be easier to make into a test):

use BlockDist, CommDiagnostics;

config const m = 1000000;

const Space = {1..m} dmapped Block({1..m});
var A: [Space] real;

{ writeln("Max reduce:");
resetCommDiagnostics(); startCommDiagnostics();
var r = max reduce forall (a, a1) in zip(A, A) do 1;
stopCommDiagnostics(); printCommDiags(); }

{ writeln("\nReduce intent");
resetCommDiagnostics(); startCommDiagnostics();
var r: real;
forall (a, a1) in zip(A, A) with (max reduce r) do r = max(r, 1);
stopCommDiagnostics(); printCommDiags(); }

proc printCommDiags() {
  const diags = getCommDiagnostics();
  for (loc, diag) in zip (Locales, diags) do
    writeln(loc, " initiated ", diag.execute_on, " AMs");
}
Max reduce:
LOCALE0 initiated 0 AMs
LOCALE1 initiated 28 AMs
LOCALE2 initiated 28 AMs
LOCALE3 initiated 28 AMs

Reduce intent
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs

In the reduce intent case we'll still create numLocales tasks on locale 0, but that's a lot more scalable than numLocales*maxTaskPar tasks.

bradcray commented 5 years ago

@vasslitvinov: That's what I was hoping you'd say, thanks!

Also, to mention something that wasn't clear to me in Elliot's description above: Even the better-behaved forall loop with reduce intent case creates a number of tasks equal to the number of locales (just not the number of locales * maxTaskPar). Something we'll need to address eventually...

[oops, he just clarified it too as I was typing this]

vasslitvinov commented 5 years ago

How many tasks do we get if we take the reduce intent out?

ronawho commented 5 years ago

How many tasks do we get if we take the reduce intent out?

Remote locales will not creates any tasks on locale 0 if you remove the reduce intent.

vasslitvinov commented 5 years ago

For a reduction, each locale needs to send its partial result to its parent. So 1 extra AM per locale makes sense. Of course we can do better, ex. with tree-based remote task spawning.

[after-thought] or if we implement GPU-style aggregation of partial results.

ronawho commented 5 years ago

For a reduction, each locale needs to send its partial result to its parent. So 1 extra AM per locale makes sense. Of course we can do better, ex. with tree-based remote task spawning.

Right, for the current implementation, 1 AM per locale makes sense and is slightly more scaleable than numLocales*masTaskPar in the reduce expr case. Longer term I agree we need tree-based reductions and/or hardware assisted reductions (i.e. aries supports a max atomic op, so if you did the final accumulation into an atomic we could use that and not create any tasks.)

Switching reduce expressions to ForallStmt-based implementation is my WIP.

What's the timeframe for completing that?

vasslitvinov commented 5 years ago

Switching reduce expressions to ForallStmt-based implementation is my WIP.

My optimistic forecast is by year-end.

ronawho commented 5 years ago

This has been resolved by https://github.com/chapel-lang/chapel/pull/12131

The reproducer in https://github.com/chapel-lang/chapel/issues/11820#issuecomment-445353457 now reports:

Max reduce:
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs

Reduce intent
LOCALE0 initiated 0 AMs
LOCALE1 initiated 1 AMs
LOCALE2 initiated 1 AMs
LOCALE3 initiated 1 AMs