Why is mandelbrot-complex slower with --llvm?

mppf commented 6 years ago

The LLVM backend is ~40% slower for the mandelbrot-complex benchmark:

https://chapel-lang.org/perf/chapcs/llvm/?startdate=2018/07/12&graphs=mandelbrotall,mandelbrotvariations

it takes about 25 s vs 18 s with the C backend.

This spike is to investigate the reason for this performance difference.

PR #7070 might be related.

mppf commented 5 years ago

On my system, C backend+GCC without --no-ieee-float -> 72s; with --no-ieee-float -> 15s. The LLVM version is about 17s.

The customized version of cabs from PR #7070 does not appear to be necessary anymore, at least not with LLVM 6 or 8.

mppf commented 5 years ago

I think this might be related: https://medium.com/@smcallis_71148/complex-arithmetic-is-complicated-873ec0c69fc5

With -ffast-math, complex multiplies and abs calls with GCC can use the naive algorithm, but with clang, the complex multiply includes conditionals.

bradcray commented 5 years ago

This makes me wonder: For cases where we've supplied our own real * complex or imag * complex overloads (or other similar operations), as @damianmoz was asking about over on issue #11941, are we being overly cavalier and should we be doing more work to handle NaN / infinity cases more properly? (similar to the "full of conditionals" gcc code in the article Michael referenced).

bradcray commented 5 years ago

Looks like Michael's may be wondering the same thing over here.

mppf commented 5 years ago

@bradcray - I don't think that there's any challenge to real complex or imag complex in this setting, because the nan-ness / inf-ness will propagate. In fact the C standard defines these things (in the informative Annex G) and doesn't suggest similar conditionals to the complex * complex case.

bradcray commented 5 years ago

I was hoping that might be the case, but was worried that maybe I was being naive. E.g., I wasn't certain whether, if the real op complex.real portion became NaN (say) whether, semantically, that should also be expected to affect the real op complex.imag component if it did not become NaN (say).

Or maybe put another way: In the same way that the article's author said s/he'd just used the FOIL method from school, that's the way I approached any operations on complexes that I implemented in module code, so was curious whether I'd introduced other naive assumptions.

I also want to emphasize the "or other similar operations" in my question in case despite multiplication being easy some other case is not (like division or exponentiation, say, though with a quick glance it looks like we don't support mixed complex / float exponentiation overloads...).

mppf commented 5 years ago

Note, I just tried -fcx-limited-range with GCC on mandelbrot-complex and it's much slower than with -ffast-math . I think it'd be worth understanding what about -ffast-math is providing the performance here.

damianmoz commented 5 years ago

@mppf, at the risk of oversimplifying things, the -fcx-limited-range I think does appropriate (whatever that means) scaling of the numerator by (I would guess) something like an exact power of 2 that keeps the denominator in a nice range when summing the squares of the real and imaginary parts of the denominator. The conditionals to extract the scale factor and the scaling will add significant overhead. I do not use that option because I would hope that I pay careful enough attention to when I need a complex division that I handle this scenario myself. And if I am (regularly?) stupid enough to forget, it is my own fault. But then you need such a capability/option to protect people from themselves and stay strict. But if you are using C++ complex numbers, isn't something that gets handled by the back end unless I am mistaken about how the Chapel compiler works (which is always possible).

mppf commented 5 years ago

@damianmoz - the benchmark is doing complex multiplication, not division. We're using C99 complex numbers which are different from C++ complex numbers (but they definitely have some similarities).

But if you are using C++ complex numbers, isn't something that gets handled by the back end unless I am mistaken about how the Chapel compiler works (which is always possible).

Even if GCC or LLVM optimizations handle it, I need to understand what is key to the performance here, since it's not always working optimally. In particular it seems to require --no-ieee-float / -ffast-math, which is a bit unfortunate, and we still have the problem that --llvm is slower than with the C backend (and we don't know why).

damianmoz commented 5 years ago

I meant C99 complex numbers. Sorry. Its Friday here and my brain must be in weekend mode.

When I looked at the mandelbrot-complex benchmark in the jacobnelson directory, I could not see any division either. So I figured I was not answering your question.

It is not a simple problem obviously. What if you recode this in C99? Do you see similar issues or am I totally missing the point?

chapel-lang / chapel

Why is mandelbrot-complex slower with --llvm? #11156