Open ronawho opened 5 years ago
Offline we had wondered if https://github.com/chapel-lang/chapel/pull/13824 might have performance implications. We didn't see any in some microbenchmarks we made (shown below) but there appears to be a non-trivial one for NPB EP.
Memory-bandwidth bound microbenchmark that didn't seem to be hurt by this change:
use BlockDist, Random, Time;
config const randomFill = false;
config const n = 50_000_000_000;
const space = {1..n};
const D = space dmapped Block(space);
var A: [D] real;
if randomFill then
fillRandom(A, 314159265);
else
forall i in D do A.localAccess[i] = i;
var t: Timer; t.start();
const min = min reduce A;
writeln((min, t.elapsed()));
(That took ~0.75 seconds for -nl4
on jupiter before and after that PR)
I think Vass had tried to make a compute-bound benchmark too, but I don't think that regressed either.
For the compute-bound benchmark below, the minimal timings from 30 runs:
t1: before 1.116 after 1.096 (-1.8%)
t2: before 1.145 after 1.095 (-4.6%)
Most of the 30 runs have timings close to these minima.
jupiter, PrgEnv-gnu, chapel/1.20.0.20190822
before: compiled with --fast after: --prepend-internal-module-dir int --fast execution: -nl1
int/ChapelBase.chpl differs from the one in $CHL_HOME like so:
< inline proc min(x: real(?w), y: real(w)) return if x < y then x else y;
> inline proc min(x: real(?w), y: real(w)) { compilerWarning("vass min"); return if (x < y) | isnan(x) then x else y; }
I also had int/ChapelReduce.chpl, that shouldn't matter.
use Random, Time;
config const n = 100_000_000;
config const numnan = -1;
config const fac = -1.0;
config const init = 1.0;
var a = init;
var b = 0.0;
// warm up
for i in 1..n {
a *= fac;
b = min(b,a);
}
var t1: Timer; t1.start();
for i in 1..n {
a *= fac;
b = min(b,a);
a *= fac;
b = min(a,b);
}
t1.stop();
const b1 = b;
var t2: Timer; t2.start();
for i in 1..n {
a *= if i == numnan then NAN else fac;
b = min(b,a);
a *= fac;
b = min(a,b);
}
t2.stop();
const b2 = b;
writeln(t1.elapsed(), " ", t2.elapsed());
if numnan > 0 then writeln(b1,b2);
[added:] Brad suggested that possibly the C compiler is seeing that I am filtering an exceptional case out and so using faster-path code as a result.
npb/ep/mcahir/ep.chpl contains this:
var l = max(abs(tx), abs(ty)):int+1;
With #13824 "min/max with NaNs", the generated code went from:
if (call_tmp_chpl82 > call_tmp_chpl83) {
to
call_tmp_chpl84 = (call_tmp_chpl82 > call_tmp_chpl83);
call_tmp_chpl85 = chpl_macro_double_isnan(call_tmp_chpl82);
if (call_tmp_chpl84 | (call_tmp_chpl85 != INT64(0))) {
causing a slowdown of 12%.
If instead of the library max() we change ep.chpl to invoke a custom max that does not cater to NaNs, the performance goes back.
Interestingly, if we switch from "max on reals then cast to int" to "cast each real to int then max", the performance recovers -- almost. It goes to the small, although detectable, 0.3% above the pre-#13824 performance.
One suggestion is to have a flag along the lines of --no-nan-comparisons
that causes these comparisons to be implemented in the old, not-taking-NaN-into-consideration way. The user would opt into if they have a case like this that suffers. We could then create two variations of EP, one that throws the flag and one that doesn't.
This is discussed in #14000.
if (call_tmp_chpl84 | (call_tmp_chpl85 != INT64(0))) {
Is that supposed to be a "bitwise or", or should it be a "logical or"?
Is that supposed to be a "bitwise or", or should it be a "logical or"?
I saw Julia or somebody else use a "bitwise or" in this case. I thought "clever them, compute the two tests in parallel, rather than one after another". So I gave it a shot.
What do you think about can switching to "logical or" on master and watching performance impact?
It doesn't impact performance, but it seemed odd to me.
FWIW throwing --no-ieee-float
seems to restore performance. We map that to -ffast-math
under gcc which among other things enables -ffinite-math-only
, which I think disables NaN checks.
It [using "bitwise or", not "logical or"] doesn't impact performance, but it seemed odd to me.
Turns out yes, it impacts performance, alas not to the better: https://github.com/chapel-lang/chapel/issues/14000#issuecomment-1911690560
There was a performance regression from "Min/max with NaNs and imag inequalities" (#13824) for NPB EP.
Here's some graphs: