At what complexity in a floating-point dominated 'forall' does parallelism speed-up occur

damianmoz commented 5 years ago

Feel free to tell me how to go fishing for this answer if it already exists on Github. But I certainly cannot find it with my searches.

There also may be a smarter heading.

Algebraically, applying a Givens rotation to two vectors involves 4 multiplications and 2 additions as you loop through the consecutive elements of the vectors.

    // given c, s being real(w) and both aci[1..m] acj[1..m] also being real(w),
    // and most likely being a 'ref' to the i'th and j'th column of some matrix 'a'

    forall k in 1..m do
    {
            const y = acj[k];
        const z = aci[k];

        acj[k] = y * c + z * s;
        aci[k] = z * c - y * s;
    }

Is the following, exploiting complex arithmetic, with only a single algebraic multiplication for Chapel's optimizer to recognize and process, simpler/better. The real question is whether it should or would produce faster code at some stage during Chapel's evolution (but not necessarily today).

    v = (c, s):complex(w);
    forall k in 1 .. m do
    {
        const u = (aci[k], acj[k]):complex(w);
        const t = u * v;

        aci[k] = t.re;
        acj[k] = t.im;
    }

Or, is the zip'ed solution faster and cleaner even if it is a bit more abstract?

   forall (aj, ai) in zip(acj, aci) do
   {
        const w = v * (ai, aj):complex;

        aj = w.im;
        ai = w.re;
   }

@bradcray, I am warming to the 'zip' but only if it buys me performance. I really do not understand enough of what it does and how it does it but it seems to work, especially in the floating point context.

Are these loop internals so trivial that the overhead of a 'forall' for say a length of 10, 40, 100, 400 and 1000 is so high that it is faster to run serially (and maybe unrolled). There are likely five answers here. Feel free to answer that question in terms of 'what is planned' rather than what today's reality is. I do not necessarily want an answer in terms of what exists today (because I can compute that myself) although if what exists today is regarded an acceptable eventuality, that works for me.

Does introducing complex into what are real computations pull in a whole lot of unwanted overhead?

Does Chapel handle its own complex number arithmetic or does it rely on something inherent in the underlying compiler technology, i.e. LLVM/Clang or gcc?

@mppf , I realise that the above will yield different results for the cases of being computed with and without using fused multiply-add (FMA) instructions, although that is not really part of this particular issue.

bradcray commented 5 years ago

Hi @damianmoz —

I'm always reluctant to give guesses as to what performance users should "expect" from code since it can vary a lot depending on target hardware, back-end compiler, choice of tasking layer, etc. But off the cuff, I wouldn't think twice about using a forall for a loop like this with a trip count of 1000 or 400, probably would still use them for 100 and 40, though I'm less confident in those cases, and tend to think 10 may be too small to be worthwhile. But the best way to determine would be to run the trials yourself using the hardware and settings you care about (making sure to use the --fast flag).

With respect to projecting into the future, the main thing I'd say is that the basics of loops like this are unlikely to change dramatically in the non-zippered case. One likely improvement is that we'll hopefully do a better and better job of vectorizing such loops in upcoming versions of the compiler (which would be motivation for leaving them as forall loops), and may do a better job of heuristically squashing parallelism that's overkill.

All that said, note that there are other ways to throttle parallelism beyond rewriting a forall loop into a for loop, such as using the dataParMinGranularity knob or the serial statement (which conditionally squashes parallelism based on an optional boolean expression that follows it). So personally, I'd probably take the approach of expressing parallelism wherever it seems likely to be appropriate and then tuning it away only as it became problematic.

W.r.t. your musing about random access vs. zippering: ultimately, the hope is that the two forms you've written will essentially perform equivalently in all cases. Today, if I were playing the safe bet from a performance perspective, I'd go with the non-zippered approach (particularly if you also like it from a stylistic perspective). There are some cases today where we regrettably lose some performance in zippered cases...

damianmoz commented 5 years ago

Thanks @bradcray. A lot of the time, I deal with less than 100. I will investigate. I will look into that knob. I prefer non-zippered except for low level stuff because that makes it far more readable.

In the meantime @mppf, would your optimization handle the first (4 multiplications to discover) or the second form (1 multiplication to discover even if it still entails 4 floating point multiply instructions to implement). I will send you a more extreme case shortly.

damianmoz commented 5 years ago

I should have more clear. There are 3 aspects to the above which probably got muddled with far too much Christmas cheer. There is more beyond the raw loop performance on which @bradcray gave me very thorough details, I am also asking how best to write such an algorithm to achieve an optimal IEEEE 754 implementation, whatever optimal means. I am also asking if indeed the IEEE754 aspects of complex arithmetic are on @mppf 's radar. And finally I am asking what is the best candidate code (or coding approach) for @mppf and the back end code generator to optimize. I assume that minimizing statement complexity is a good thing, although complexity is sometimes hard to quantify.

@mppf, do you just hand complex number arithmetic over to some LLVM complex number arithmetic macro to process which gives you a fighting chance to achieve the higher accuracy, or does Chapel do the complex arithmetic itself and pass some basic parse tree over to the code generator? That might not be the right question I need to ask but it is certainly close. Are the IEEE754 aspects something you want on the radar?

To put you in the picture on the accuracy issues, if you look at the the first loop, the computation of the real and imaginary component of each element of both arrays can be computed to 3xEps accuracy at best where Eps is the machine precision. But, there are ways that the second loop can compute this to 1*Eps accuracy (for the case where the vectors are real). It relies on the fact that it is a single complex multiplication for which there are known algorithms to achieve such accuracy, with or without an FMA instruction. Whether C++ (gcc or clang) does this today I do not know. But it will come to be needed during Chapel's lifetime.

A more extreme example for the first loop is where aci and acj are now complex vectors and c and s are still real. The compilation of the first loop means identifying those four algebraic multiplications, each of which will need two multiply instructions because the elements of the vectors are now complex numbers (@bradcray, that was the reason for my earlier question which I really should have put here - sorry). The best accuracy I could expect is the 3*Eps in the real or imaginary components of the resulting numbers.

As the second loop now makes no sense because the vector elements are complex, I give you a replacement which again stufsf c and s into a complex cs as before and the algorithm can be rewritten such that there are only two algebraic complex multiplications for the compiler to recognize and optimize:

proc crot(cs : ?C, ref xv : [?xD] C, ref yv : [?yD] C)
{
    forall (x, y) in zip(xv, yv) do
    {
        const g = cs * (y.re, x.re):complex;
        const h = cs * (y.im, x.im):complex;

        (x, y) = ((g.im, h.im):complex, (g.re, h.re):complex);
    }
}

Each of the algebraic multiplications above will have four multiply instructions (and more logic if you want to attain that 1*Eps accuracy in each component of both complex numbers).

For this more complex, which is better as a base from which you can optimize please @mppf?

I do not think that striving today for the best IEEE754 implementation is a high priority by the way. I am just flagging it while IEEE 754 is on the radar/horizon for other reasons.

mppf commented 5 years ago

@damianmoz - yes we are using C99 complex numbers.

To answer your questions I created a little standalone program to experiment with:

config const n = 100;
config const m = 10000000;

proc kernel() {
  var A:[1..n] complex = 1;

  for i in 1..m {
    for j in vectorizeOnly(1..n) {
      A[j] = A[j] * A[j];
    }
  }
}

kernel();

The C backend with GCC generates a call to __muldc3 unless we compile it with --no-ieee-float (which turns into -ffast-math).

The --llvm version appears to have an inlined version of a similar function, which involves vmulpd, vpermilpd, .... vucomisd. With --no-ieee-float it becomes a vectorized sequence involving vunpcklpd ... vmulpd ... vblendpd.

mppf commented 5 years ago

@damianmoz - I've poked at the question a little bit more and I think I now understand something I didn't before.

For complex multiplication, we have in C a lot of code to do the multiplication, so that infinities and nans are handled in a mathematically appropriate way. (I found this article explaining but you probably don't need to read it if the previous fact is obvious to you... https://medium.com/@smcallis_71148/complex-arithmetic-is-complicated-873ec0c69fc5 ). Anyway the result is that in IEEE 754 compliant mode, a C program running complex multiplication with the builtin operator is slower than one writing out the multiplication itself, unless certain flags are used to disable the additional logic. Note that this complex multiplication with NaN and Inf business is from an "informative" (vs "normative") part of the C standard and as far as I can tell not part of IEEE 754 at all.

I think the result is that we should consider carefully what to do with Chapel: a) Just always use the "usual mathematical formula" in Chapel complex multiplication. b) Do what the C annex G says (i.e. more instructions to handle infinities) c) Add a Chapel-level flag to avoid the inf/nan handling for complex multiplication etc. and instead use the "usual mathematical formula" (as the C standard puts it), so that a user could toggle between a and b.

This S.O. post indicates that in Fortran, it's up to the implementation. https://stackoverflow.com/questions/39318880/complex-multiplication-divison-with-infinity-and-nan

(It took me a minute to figure out what the Smith 1962 method referred to - it's a strategy for complex division included here below in case we need it later)


Algorithm 116
Complex division
Robert L. Smith
Stanford University, Stanford, Calif.
procedure complexdiv (a, b, c, d) results: (e, f);
value a, b, c, d; real a, b, c, d;
comment complexdiv yields the complex quotient of a+ib divided by c+id. The method used here tends to avoid arithmetic overflow or underflow. Such spills could otherwise occur when squaring the component parts of the denominator if the usual method were used;
begin real r, den;
  if abs (c) >= abs(d) then
  begin r := d/c;
    den := c + r * d;
    e = (a + b * r) / den;
    f := (b - a * r) / den;
  end
  else
  begin r := c/d;
    den := d + r*c;
    e := (a * r + b) / den;
    f := (b * r - a) / den;
  end
end complexdiv

bradcray commented 5 years ago

Add a Chapel-level flag

Just a reminder w.r.t. this comment (and perhaps issue #11335 as well) that we have a --no-ieee-float compiler flag today that we could use for this purpose, though I believe that today it only results in passing hints to the back-end C compiler and doesn't affect the algorithms used by Chapel at all (though it could / arguably should).

That said, I also find myself wondering whether we should add more parameterization to the real/imag/complex types such that one could say something like real(64, strictIEEE=true) or real(64, false) to opt into the strictness of operations on the per-element level rather than using one global switch (where an obvious question would be whether operators default to strict or non-strict if they're passed a mix of different strictnesses).

damianmoz commented 5 years ago

@mppf, so much homework for me to do! When I see what the C99 Annex says, I finally understand what you mean by needing to handle all of this Inf and NaN stuff. Sorry for thinking you were just rabbiting on. And yes, complex number handling is not part of the IEEE 754 standard although I think that C99 Annex follows follows the spirit of IEEE754 elementary function handling.

I like/need option (c) or am I making too much work for you. Mind you, if I have Inf's or NaN's in my code, I have bigger problems!

@bradcray. having strictIEEE on a per element level is significantly added complexity and probably a source of misunderstanding/bugs that would take ages to find. A mixture could result in making a formal error analysis meaningless.

mppf commented 5 years ago

from @bradcray

Add a Chapel-level flag

Just a reminder w.r.t. this comment (and perhaps issue #11335 as well) that we have a --no-ieee-float compiler flag today that we could use for this purpose, though I believe that today it only results in passing hints to the back-end C compiler and doesn't affect the algorithms used by Chapel at all (though it could / arguably should).

I think the concern with attaching the dont-worry-about-inf-or-nan behavior to --no-ieee-float is that --no-ieee-float also allows approximations and optimizations that change evaluation order. It seems that these would mess up the consistency with formal error analysis that @damianmoz is looking for. So I think we'd need more detailed flags that control individual components, e.g.

dont-worry-about-inf-or-nan could be one flag
allow approximate functions for sin etc could be another
allow evaluation order reorderings could be another

I think the challenge here is how to create a useful set of more detailed flags without creating a whole lot of them.

from @damianmoz

I like/need option (c) or am I making too much work for you. Mind you, if I have Inf's or NaN's in my code, I have bigger problems!

Flags are nice for flexibility and everything, but if any Inf/Nan in your code is an error, isn't (a) good enough for you ?

damianmoz commented 5 years ago

@mppf, you say ... Flags are nice for flexibility and everything, but if any Inf/Nan in your code is an error, isn't (a) good enough for you ?

The answer is a conditional yes but my usage is not broad enough. An Inf/NaN coming out of my code is a problem, not necessarily an error or even my own error. It may be an indication that I need to rescale a matrix (which I avoided on a first pass for reasons of speed because most of the time I get away with it) or that I am looking at some soil that is softening (which makes a determinant go to zero) so it indicates I need to use a different algorithm.

I chased two others I know for their opinion from your neck of the woods. They know way than I but they are busy with other stuff at the moment. I suggest we eventually put a working paper together and then if I run it past them, I might get their attention. One @bradcray knows, one he probably does not.

chapel-lang / chapel

At what complexity in a floating-point dominated 'forall' does parallelism speed-up occur #11936