Open Barracuda72 opened 4 years ago
why are you using goto
instead of do {} while()
. or even a for(;;;)
loop since you are using a variable for it.
const int batch = 8;
for(int n{0}; (n < iterations) && (_mm256_movemask_pd(_mask1) > 0); n += batch)
for (int q{0}; q < batch; ++q) {
_zr2 = _mm256_mul_pd(_zr, _zr);
_zi2 = _mm256_mul_pd(_zi, _zi);
_a = _mm256_sub_pd(_zr2, _zi2);
_a = _mm256_add_pd(_a, _cr);
_b = _mm256_mul_pd(_zr, _zi);
_b = _mm256_fmadd_pd(_b, _two, _ci);
_zr = _a;
_zi = _b;
_a = _mm256_add_pd(_zr2, _zi2);
_mask1 = _mm256_cmp_pd(_a, _four, _CMP_LT_OQ);
_n = _mm256_sub_epi64(_n, _mm256_castpd_si256(_mask1))
}
First and foremost, I want to thank @OneLoneCoder for his great video series!
Next, I'd like to propose following optimizations to the Mandelbrot fractal code generation. I'll be focusing on internal
repeat...goto
loop ofolcFractalExplorer::CreateFractalIntrinsics
because it's there where the most interesting things happen:Let's introduce an additional variable called
n
that will count number of iterations we've gone through. Just your usual integer counter, nothing more, nothing less. We will check value of that variable inif ()
block.But that's not for nothings' sake. Turns out, we now can remove
_mask2
and it's computation from the program! Because_mask1
contains everything we need (and even more than that, but let's not rush) - it's a group of flags that tells us, should we continue iterative process (i.e. increment_n
) or not. Let's removemask2
:OK. We've traded two AVX instructions (albeit not very computationally expensive) for three basic integer & logic ones. Does it impacts performance? Well, yes! This version is already 5-10% faster than original (on my Skylake machine, at least).
Look at the poor
_c
variable. Its' only purpose in life is to hold the result of_one & _mask1
expression._mask1
already contains necessary info about the_n
vector elements that we should increment in form of boolean values. Sadly,true
in AVX is0xFFFFFFFF
(size depends), not just1
as we would like to. But if we remember informatics (school program should cover it, IIRC), we can recognize this: that's-1
in two's complement! So, instead ofwe could write
and save another AVX operation! Let's do it:
How's the performance? Well, not that much big of a difference, but some % we gained for sure. I'm too lazy to measure, but probably around 2-3%. And we simplified our code, that is always a good thing.
To keep processor busy and effective we need to fill its' pipeline with plenty of instructions and data without any interruptions. Of course we can just copy the above code several times and that would be it. But that's not very clear solution; there's better one. Remember loop unrolling thing from the video? That's that we will use.
And that's quite simple, actually. Just wrap almost everything between
repeat
andif()
intofor
loop with smallbatch
size. You can play with this parameter and look how speedup changes; only remember, thatbatch
should be specified at compile-time, else compiler wouldn't be able to unroll the loop (well, actually, there is a way, but let's not dive too deep into the dark magic). And it should be small enough. IIRC, GCC has unroll threshold of around 20 loop cycles by default (and that behaviour could be tweaked thru command-line parameters). Clang has more than that, somewhere around dozens, if not hundreds. Can't say anything about MSVC.Here's the code:
This will work quite fine. The only problem is, computation of the batch won't stop in the middle, so even if everything was computed at the first or second pass, all the remaining still will be run. So in the best-case scenario this code will perform worse than original one. We can move away from the fractal into the empty space and make sure with our own eyes. But in absolute numbers this degradation is quite low and I suppose we are experimenting with the fractals to look at the fractals, not on the empty space around them, right?
All this optimizations combined on my machine give from 5% to 20% increase in performance, depending on compiler (GCC vs clang), batch size, maximum number of iterations and phase of the moon. Probably there's something else that I'm missing; I would be glad to hear that. (At the same time I purposely left out most explanations about advanced CS stuff like numerical stability and convergence; I think that's not so interesting as the real programming.)