Closed sharkautarch closed 3 weeks ago
Not sure it's that worth it to be honest, we are talking like a few cycles here. Don't think the microoptimization is really worth it.
Not sure it's that worth it to be honest, we are talking like a few cycles here. Don't think the microoptimization is really worth it.
Ok, I've simplified the PR: somehow a slight refactor of GetGeometry()
that leverages __builtin_unreachable()
allows the compiler to optimize the calls to GetGeometry()
Note that __builtin_unreachable()
acts similar to an assert fail when doing a debug build, but is treated as an unreachable branch when building release builds (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-funreachable-traps)
EDIT: it seems like __builtin_unreachable()
wasn't entirely necessary, seems like simply reducing GetGeometry()
to just having a single return is sufficient for the compiler to optimize stuff better.
Are you building with optimizations? These two should really be equivelant...
Are you building with optimizations? These two should really be equivelant...
Yes, I've tested both versions w/ both compiler optimization and LTO on.
And yes, it is really weird that this small change actually makes a difference
Tho I've only tested what the disassembly looks like when compiling both versions w/ gcc (gcc version 14.1.1 20240522) so it could be possible that this is a missed optimization opportunity on gcc's end...
My speculation for why this makes a difference is either:
rect
(the if branches do assignments to the same rect
object). For bigger objects, doing assignments instead of constructions is less efficient, but Rect
is a trivial object and can be moved with a single vector instruction. I will say that I had originally thought that adding __attribute__((pure))
to GetGeometry()
would be the simplest way to get the compiler to optimize stuff better, but when I tested adding that attribute, there wasn't any difference in the compiler's output
@Joshua-Ashton
I've now realized that the reason why the refactor allowed gcc to optimize more, was that it simply caused gcc to decide to inline GetGeometry()
more, and whenever gcc inlines the function, it optimizes out all of the redundant loads/stores. (It was easier to see this when I looked at the gimple IR code that gets dumped when adding -fdump-tree-optimized
to the command line options for gcc)
closing this PR since this is entirely due to the behavior of gcc
but note that you can force GetGeometry()
to be inlined by adding __attribute__((always_inline))
to it (Rect __attribute__((always_inline)) GetGeometry() const { ... }
)
Note that this attribute is supported by both gcc and clang
Followup Pr to commit c5a5ba00b655caa1c10656dfe3ae09e281ec0544 From checking how
GetGeometry()
is compiled w/ gcc, it seems that the compiler does not know it is safe to automatically re-use the 128bitRect
returned fromGetGeometry()
So unfortunately, doing
Would result in fetching the 128bit rect, getting the width from it And then re-fetching the 128bit rect again to get the height from it