Weird performance characteristics on PPC

So, after a compiler upgrade, I just noticed that the AltiVec implementation has a weird performance characteristics when compiled with different GCC versions. For some reason, if I use GCC 8 to compile the code, the program will be faster by almost two times compared with the version compiled with GCC 7 or 9.

Also, for completeness' sake, I modified the handler to use plain #defines instead of struct/loop wrappers (see the altivec-unwrapped branch). With this implementation, all the compilers I tested gives roughly the same performance.

Below is the result of my testing:

GCC version	altivec-unwrapped GFLOPS	master ("wrapped") GFLOPS
7.4.0	4.11	4.19
8.3.0	4.36	8.53
9.2.1	4.46	4.43

(All numbers are taken from the GFLOPS given in the final message.)

All the resulting files have the same hash, so I don't think that the compiler broke anything during optimization.

06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc7-unwrapped.png
06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc7-wrapped.png
06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc8-unwrapped.png
06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc8-wrapped.png
06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc9-unwrapped.png
06d0386092bbedc945327c13cf872bfa72b95458ae3e802ce45fa14c0cac4722  gcc9-wrapped.png

What's happening here? What can I do to make the output of the other GCC versions fast? Any help/direction from someone knowledgeable in C++ and/or GCC internals would be very appreciated.

Note:

The system is a 2 GHz PPC970MP, running Debian unstable with Linux version 5.2.0-2-powerpc64. The output of gcc -v for each version is attached.
The image I use is taken from here. I've also attached it so others could reproduce it easily.

Files:

GCC 7:

GCC 8:

GCC 9:

The source image:

51982698_p37.jpg

DeadSix27 / waifu2x-converter-cpp

Weird performance characteristics on PPC #201