chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 421 forks source link

NPB: EP performance regression from #13824 #13922

Open ronawho opened 5 years ago

ronawho commented 5 years ago

There was a performance regression from "Min/max with NaNs and imag inequalities" (#13824) for NPB EP.

Here's some graphs:

image

ronawho commented 5 years ago

Offline we had wondered if https://github.com/chapel-lang/chapel/pull/13824 might have performance implications. We didn't see any in some microbenchmarks we made (shown below) but there appears to be a non-trivial one for NPB EP.

Memory-bandwidth bound microbenchmark that didn't seem to be hurt by this change:

use BlockDist, Random, Time;

config const randomFill = false;
config const n = 50_000_000_000;
const space = {1..n};
const D = space dmapped Block(space);
var A: [D] real;

if randomFill then
  fillRandom(A, 314159265);
else
  forall i in D do A.localAccess[i] = i;

var t: Timer; t.start();
const min = min reduce A;
writeln((min, t.elapsed()));

(That took ~0.75 seconds for -nl4 on jupiter before and after that PR)

I think Vass had tried to make a compute-bound benchmark too, but I don't think that regressed either.

vasslitvinov commented 5 years ago

For the compute-bound benchmark below, the minimal timings from 30 runs:

t1:  before 1.116  after 1.096  (-1.8%)
t2:  before 1.145  after 1.095  (-4.6%)

Most of the 30 runs have timings close to these minima.

Setup

jupiter, PrgEnv-gnu, chapel/1.20.0.20190822

before: compiled with --fast after: --prepend-internal-module-dir int --fast execution: -nl1

int/ChapelBase.chpl differs from the one in $CHL_HOME like so:

<   inline proc min(x: real(?w), y: real(w)) return if x < y then x else y;
>   inline proc min(x: real(?w), y: real(w)) { compilerWarning("vass min"); return if (x < y) | isnan(x) then x else y; }

I also had int/ChapelReduce.chpl, that shouldn't matter.

The code

use Random, Time;

config const n = 100_000_000;
config const numnan = -1;
config const fac = -1.0;
config const init = 1.0;

var a = init;
var b = 0.0;

// warm up
for i in 1..n {
  a *= fac;
  b = min(b,a);
}

var t1: Timer; t1.start();
for i in 1..n {
  a *= fac;
  b = min(b,a);
  a *= fac;
  b = min(a,b);
}
t1.stop();
const b1 = b;

var t2: Timer; t2.start();
for i in 1..n {
  a *= if i == numnan then NAN else fac;
  b = min(b,a);
  a *= fac;
  b = min(a,b);
}
t2.stop();
const b2 = b;

writeln(t1.elapsed(), "  ", t2.elapsed());
if numnan > 0 then writeln(b1,b2);

[added:] Brad suggested that possibly the C compiler is seeing that I am filtering an exceptional case out and so using faster-path code as a result.

vasslitvinov commented 5 years ago

npb/ep/mcahir/ep.chpl contains this:

var l = max(abs(tx), abs(ty)):int+1;

With #13824 "min/max with NaNs", the generated code went from:

if (call_tmp_chpl82 > call_tmp_chpl83) {

to

call_tmp_chpl84 = (call_tmp_chpl82 > call_tmp_chpl83);
call_tmp_chpl85 = chpl_macro_double_isnan(call_tmp_chpl82);
if (call_tmp_chpl84 | (call_tmp_chpl85 != INT64(0))) {

causing a slowdown of 12%.

If instead of the library max() we change ep.chpl to invoke a custom max that does not cater to NaNs, the performance goes back.

Interestingly, if we switch from "max on reals then cast to int" to "cast each real to int then max", the performance recovers -- almost. It goes to the small, although detectable, 0.3% above the pre-#13824 performance.

vasslitvinov commented 5 years ago

One suggestion is to have a flag along the lines of --no-nan-comparisons that causes these comparisons to be implemented in the old, not-taking-NaN-into-consideration way. The user would opt into if they have a case like this that suffers. We could then create two variations of EP, one that throws the flag and one that doesn't.

This is discussed in #14000.

ronawho commented 5 years ago

if (call_tmp_chpl84 | (call_tmp_chpl85 != INT64(0))) {

Is that supposed to be a "bitwise or", or should it be a "logical or"?

vasslitvinov commented 5 years ago

Is that supposed to be a "bitwise or", or should it be a "logical or"?

I saw Julia or somebody else use a "bitwise or" in this case. I thought "clever them, compute the two tests in parallel, rather than one after another". So I gave it a shot.

What do you think about can switching to "logical or" on master and watching performance impact?

ronawho commented 5 years ago

It doesn't impact performance, but it seemed odd to me.

ronawho commented 5 years ago

FWIW throwing --no-ieee-float seems to restore performance. We map that to -ffast-math under gcc which among other things enables -ffinite-math-only, which I think disables NaN checks.

vasslitvinov commented 9 months ago

It [using "bitwise or", not "logical or"] doesn't impact performance, but it seemed odd to me.

Turns out yes, it impacts performance, alas not to the better: https://github.com/chapel-lang/chapel/issues/14000#issuecomment-1911690560