So, after a compiler upgrade, I just noticed that the AltiVec implementation has a weird performance characteristics when compiled with different GCC versions.
For some reason, if I use GCC 8 to compile the code, the program will be faster by almost two times compared with the version compiled with GCC 7 or 9.
Also, for completeness' sake, I modified the handler to use plain #defines instead of struct/loop wrappers (see the altivec-unwrapped branch).
With this implementation, all the compilers I tested gives roughly the same performance.
Below is the result of my testing:
GCC version
altivec-unwrapped GFLOPS
master ("wrapped") GFLOPS
7.4.0
4.11
4.19
8.3.0
4.36
8.53
9.2.1
4.46
4.43
(All numbers are taken from the GFLOPS given in the final message.)
All the resulting files have the same hash, so I don't think that the compiler broke anything during optimization.
What's happening here? What can I do to make the output of the other GCC versions fast?
Any help/direction from someone knowledgeable in C++ and/or GCC internals would be very appreciated.
Note:
The system is a 2 GHz PPC970MP, running Debian unstable with Linux version 5.2.0-2-powerpc64. The output of gcc -v for each version is attached.
The image I use is taken from here. I've also attached it so others could reproduce it easily.
So, it looks like the aggessive optimization flags and unroll settings I use causes GCC 9 to generate worse code. Changing it to something less aggressive seems to fix this issue.
So, after a compiler upgrade, I just noticed that the AltiVec implementation has a weird performance characteristics when compiled with different GCC versions. For some reason, if I use GCC 8 to compile the code, the program will be faster by almost two times compared with the version compiled with GCC 7 or 9.
Also, for completeness' sake, I modified the handler to use plain
#define
s instead of struct/loop wrappers (see the altivec-unwrapped branch). With this implementation, all the compilers I tested gives roughly the same performance.Below is the result of my testing:
(All numbers are taken from the GFLOPS given in the final message.)
All the resulting files have the same hash, so I don't think that the compiler broke anything during optimization.
What's happening here? What can I do to make the output of the other GCC versions fast? Any help/direction from someone knowledgeable in C++ and/or GCC internals would be very appreciated.
Note:
5.2.0-2-powerpc64
. The output ofgcc -v
for each version is attached.Files:
GCC 7:
GCC 8:
GCC 9:
The source image: