bsnes-emu / bsnes

bsnes is a Super Nintendo (SNES) emulator focused on performance, features, and ease of use.
Other
1.64k stars 154 forks source link

Simplify sa1 division #263

Closed Alcaro closed 1 year ago

Alcaro commented 1 year ago

Removes the branch and the double modulo

I have verified that it gives the same results for all 2^32 pairs of inputs

Screwtapello commented 1 year ago

Thanks!

orbea commented 1 year ago

@Alcaro Is this about performance or accuracy? It seems slightly slower in Kirby's Dream Land 3 and no noticeable difference in Super Mario RPG.

Alcaro commented 1 year ago

Neither, it's about code simplicity. It has no accuracy impact, and should be too rarely called to affect performance (but if it does, it should be slightly faster). Whatever you're noticing is probably just noise; I've heard performance varies 2% by shuffling random functions' local variables.

orbea commented 1 year ago

Whatever you're noticing is probably just noise; I've heard performance varies 2% by shuffling random functions' local variables.

Worth considering, but upon repeating the test 3 times both with and without the change where the results seem consistent. This involves running the first 6000 frames in Kirby's Dream Land 3. I additionally closed all of my browsers to avoid background processes.

with:

called 5 times, best result: 99.0468s, avg: 100.0367s, total: 501.0837s
called 5 times, best result: 99.0599s, avg: 100.0134s, total: 500.0672s
called 5 times, best result: 99.0415s, avg:  99.0900s, total: 499.0503s

without:

called 5 times, best result: 97.0700s, avg:  98.0848s, total: 494.0242s
called 5 times, best result: 98.0230s, avg:  99.0886s, total: 499.0432s
called 5 times, best result: 98.0000s, avg:  98.0581s, total: 492.0909s
Alcaro commented 1 year ago

If I test the division functions in isolation, with no emulator nearby, the old function takes 21.5 seconds to run 2^32 times, and the new one takes 8.6 seconds.

My guess would be it's some strange aliasing based on which functions happen to share a handful of address bits. libco absolutely wrecks the branch predictor, it's not surprising that performance is erratic.

I expect that if you try multiple operating systems, compilers, optimization levels, and/or compiler versions, you'll find that it's a speed boost as often as a slowdown. If not, I'll agree that this patch is a slowdown; performance is more important than code cleanliness, even if there's no rational reason this patch should slow anything down.

(If I had infinite free time and motivation, I'd rewrite bsnes to use c++20 coroutines instead of libco and see how performance turns out.)