Further AVX optimizations

First and foremost, I want to thank @OneLoneCoder for his great video series!

Next, I'd like to propose following optimizations to the Mandelbrot fractal code generation. I'll be focusing on internal repeat...goto loop of olcFractalExplorer::CreateFractalIntrinsics because it's there where the most interesting things happen:

repeat:
        _zr2 = _mm256_mul_pd(_zr, _zr);
        _zi2 = _mm256_mul_pd(_zi, _zi);
        _a = _mm256_sub_pd(_zr2, _zi2);
        _a = _mm256_add_pd(_a, _cr);
        _b = _mm256_mul_pd(_zr, _zi);
        _b = _mm256_fmadd_pd(_b, _two, _ci);
        _zr = _a;
        _zi = _b;
        _a = _mm256_add_pd(_zr2, _zi2);
        _mask1 = _mm256_cmp_pd(_a, _four, _CMP_LT_OQ);
        _mask2 = _mm256_cmpgt_epi64(_iterations, _n);
        _mask2 = _mm256_and_si256(_mask2, _mm256_castpd_si256(_mask1));
        _c = _mm256_and_si256(_one, _mask2);
        _n = _mm256_add_epi64(_n, _c);
        if (_mm256_movemask_pd(_mm256_castsi256_pd(_mask2)) > 0)
                goto repeat;

First idea is "why do we check for exceeding maximum number of iterations for each element of the AVX vector, if in fact we only should check for any"? Well, they all have same upper limit, right? So we only should ensure that we don't exceed that limit. In other words, we will terminate our loop it two cases:

All four points "settled in place", and that's indicated by all zeroes in _mask1
Maximum number of iterations exceeded

Let's introduce an additional variable called n that will count number of iterations we've gone through. Just your usual integer counter, nothing more, nothing less. We will check value of that variable in if () block.

But that's not for nothings' sake. Turns out, we now can remove _mask2 and it's computation from the program! Because _mask1 contains everything we need (and even more than that, but let's not rush) - it's a group of flags that tells us, should we continue iterative process (i.e. increment _n) or not. Let's remove mask2:

        int n = 0;
repeat:
        _zr2 = _mm256_mul_pd(_zr, _zr);
        _zi2 = _mm256_mul_pd(_zi, _zi);
        _a = _mm256_sub_pd(_zr2, _zi2);
        _a = _mm256_add_pd(_a, _cr);
        _b = _mm256_mul_pd(_zr, _zi);
        _b = _mm256_fmadd_pd(_b, _two, _ci);
        _zr = _a;
        _zi = _b;
        _a = _mm256_add_pd(_zr2, _zi2);
        _mask1 = _mm256_cmp_pd(_a, _four, _CMP_LT_OQ);
        _c = _mm256_and_si256(_one, _mm256_castpd_si256(_mask1));
        _n = _mm256_add_epi64(_n, _c);
        n++;
        if ((_mm256_movemask_pd(_mask1) > 0) && (n < iterations))
                goto repeat;

OK. We've traded two AVX instructions (albeit not very computationally expensive) for three basic integer & logic ones. Does it impacts performance? Well, yes! This version is already 5-10% faster than original (on my Skylake machine, at least).

Can we do better? Yes we can!

Look at the poor _c variable. Its' only purpose in life is to hold the result of _one & _mask1 expression. _mask1 already contains necessary info about the _n vector elements that we should increment in form of boolean values. Sadly, true in AVX is 0xFFFFFFFF (size depends), not just 1 as we would like to. But if we remember informatics (school program should cover it, IIRC), we can recognize this: that's -1 in two's complement! So, instead of

_c =  (_one & _mask1)
_n = _n + _c

we could write

_n = _n - _mask1

and save another AVX operation! Let's do it:

        int n = 0;
repeat:
        _zr2 = _mm256_mul_pd(_zr, _zr);
        _zi2 = _mm256_mul_pd(_zi, _zi);
        _a = _mm256_sub_pd(_zr2, _zi2);
        _a = _mm256_add_pd(_a, _cr);
        _b = _mm256_mul_pd(_zr, _zi);
        _b = _mm256_fmadd_pd(_b, _two, _ci);
        _zr = _a;
        _zi = _b;
        _a = _mm256_add_pd(_zr2, _zi2);
        _mask1 = _mm256_cmp_pd(_a, _four, _CMP_LT_OQ);
        _n = _mm256_sub_epi64(_n, _mm256_castpd_si256(_mask1))
        n++;
        if ((_mm256_movemask_pd(_mask1) > 0) && (n < iterations))
                goto repeat;

How's the performance? Well, not that much big of a difference, but some % we gained for sure. I'm too lazy to measure, but probably around 2-3%. And we simplified our code, that is always a good thing.

But that's not all. If previous modification wasn't tricky enough, this one will.

To keep processor busy and effective we need to fill its' pipeline with plenty of instructions and data without any interruptions. Of course we can just copy the above code several times and that would be it. But that's not very clear solution; there's better one. Remember loop unrolling thing from the video? That's that we will use.

And that's quite simple, actually. Just wrap almost everything between repeat and if() into for loop with small batch size. You can play with this parameter and look how speedup changes; only remember, that batch should be specified at compile-time, else compiler wouldn't be able to unroll the loop (well, actually, there is a way, but let's not dive too deep into the dark magic). And it should be small enough. IIRC, GCC has unroll threshold of around 20 loop cycles by default (and that behaviour could be tweaked thru command-line parameters). Clang has more than that, somewhere around dozens, if not hundreds. Can't say anything about MSVC.

Here's the code:

        int n = 0;
        const int batch = 8;
repeat:
        for (int q = 0; q < batch; q++) {
                _zr2 = _mm256_mul_pd(_zr, _zr);
                _zi2 = _mm256_mul_pd(_zi, _zi);
                _a = _mm256_sub_pd(_zr2, _zi2);
                _a = _mm256_add_pd(_a, _cr);
                _b = _mm256_mul_pd(_zr, _zi);
                _b = _mm256_fmadd_pd(_b, _two, _ci);
                _zr = _a;
                _zi = _b;
                _a = _mm256_add_pd(_zr2, _zi2);
                _mask1 = _mm256_cmp_pd(_a, _four, _CMP_LT_OQ);
                _n = _mm256_sub_epi64(_n, _mm256_castpd_si256(_mask1))
        }
        n += batch;
        if ((_mm256_movemask_pd(_mask1) > 0) && (n < iterations))
                goto repeat;

This will work quite fine. The only problem is, computation of the batch won't stop in the middle, so even if everything was computed at the first or second pass, all the remaining still will be run. So in the best-case scenario this code will perform worse than original one. We can move away from the fractal into the empty space and make sure with our own eyes. But in absolute numbers this degradation is quite low and I suppose we are experimenting with the fractals to look at the fractals, not on the empty space around them, right?

All this optimizations combined on my machine give from 5% to 20% increase in performance, depending on compiler (GCC vs clang), batch size, maximum number of iterations and phase of the moon. Probably there's something else that I'm missing; I would be glad to hear that. (At the same time I purposely left out most explanations about advanced CS stuff like numerical stability and convergence; I think that's not so interesting as the real programming.)

OneLoneCoder / olcPixelGameEngine

Further AVX optimizations #129