LIJI32 / SameBoy

Game Boy and Game Boy Color emulator written in C
https://sameboy.github.io/
Other
1.58k stars 205 forks source link

Changed CFLAG from -O3 to -O2 #543

Closed MatteoRaso closed 1 year ago

MatteoRaso commented 1 year ago

According to the Gentoo Wiki, using -O3 is unlikely to cause any performance boost and can actually degrade performance. From testing the code, it seems that using -O3 causes a longer installation time and increased the build size by ~100 MB.

LIJI32 commented 1 year ago

Where does this 100MB figure come from? An SDL build folder (tested on macOS with Clang), before applying this change, takes about 2MB. Can you share the output of du -ha build | sort -h before and after this change?

As for speed, I indeed see a 3.7% speed boost when "downgrading" from O3 to O2, but I'd like to investigate it further and pinpoint the specific optimization that causes it.

MatteoRaso commented 1 year ago

Where does this 100MB figure come from?

It came from me playing around with the code on Linux, using the release CONF flag.

Can you share the output of du -ha build | sort -h before and after this change?

Okay, I've added the full output as a file. The TL;DR is that I got 1.7 MB before, and 1.6 MB after changing the CFLAG to O2. output.txt

LIJI32 commented 1 year ago

Oh, that's 100KB, not 100MB. That makes much more sense now. I still want to investigate which specific optimization causes the slowdown, as speed is a much higher priority than a 6% size increase or build times.

MatteoRaso commented 1 year ago

Oh, that's 100KB, not 100MB.

You're right, no idea how I made that mistake. Sorry about that.

orbea commented 1 year ago

I still want to investigate which specific optimization causes the slowdown, as speed is a much higher priority than a 6% size increase or build times.

I suspect the performance is going to be platform, architecture and/or compiler dependent in this case. Although I think that -O2 is a generally safer default than -O3 and certainly a more common default.

LIJI32 commented 1 year ago

I took a better look at this. A few points:

  1. The majority of the size boost introduced by O3 is caused by aggressive loop unrolling. In some cases it makes senses and improved speed without notable size increases, in other cases the code compiled in an awful mess of nested ifs in non-timing critical code, e.g.:

    image
  2. The speed differences are mostly caused by slightly slower warm-up periods introduced by the size increase, but on the longer runs the O3 builds were still faster than the O2 ones.

I just pushed the following changes to the code that, for the most part, will get the best of both worlds:

  1. The slow and uncommon paths of frequently called functions (such as GB_debugger_run and GB_apply_cheat) are no longer inlined, reducing code size and better utilizing cache.
  2. Not directly related, but -ffast-math was enabled which greatly improved speed despite the very minor use of floating point numbers in the core.
  3. Cases where the compiler unrolls loops too aggressively were specifically marked with a pragma to forbid loop unrolling. This made the O3 binary mostly comparable to the O2 file in size.
  4. Core and frontend sources are now given different compilation flags – O3 for the core and Oz for the rest. Since the frontend code is not timing critical, aggressively optimizing for size greatly reduces the file size and allows better cache utilization, which improves speed.

Overall, the file size was reduce by over 10% (A greater improvement than switching to O2) while speed was actually slightly improved by roughly 2%. Thanks for pointing this out and making me dig deeper!