bsnes-emu / bsnes

bsnes is a Super Nintendo (SNES) emulator focused on performance, features, and ease of use.
Other
1.64k stars 154 forks source link

bsnes burns 50% of most CPU cores doing nothing (OpenMP worker thread spin-wait) #254

Open nyanpasu64 opened 2 years ago

nyanpasu64 commented 2 years ago

When I run the AUR bsnes package (the window appearance is quite buggy, with gray-on-black menubar and transparent window background on XShm), and opening any SNES game, bsnes loads my 12-thread CPU to nearly 40%. This is due to OpenMP creating 11 worker threads across 6 CPU cores and 12 CPU threads, and when they don't have work, they run gomp_barrier_wait_end -> do_wait -> do_spin (according to profilers), burning nearly 50% of a CPU core each.

Interestingly when I interrupt bsnes in gdb, do_wait actually calls futex_wait which does not burn CPU. My theory is that each thread is woken up to perform a very small amount of computation, but very often so it spends most of its time spin-waiting, rather than either doing useful work or sleeping on a futex (not burning CPU)

I found a few workarounds. You can avoid initializing and using OpenMP by disabling Fast PPU (https://github.com/bsnes-emu/bsnes/blob/master/bsnes/sfc/ppu-fast/line.cpp#L7) and switching to a video backend other than XShm (https://github.com/bsnes-emu/bsnes/blob/master/ruby/video/xshm.cpp#L129) (OpenGL 2.0 works, though that seems to burn CPU and throttle to 60fps, and 3.0 works but I haven't tested much). Alternatively the OMP_WAIT_POLICY=passive environment variable makes OpenMP not spin-wait, in which case the OpenMP threads eat under 10% of a CPU core each. I have not tested if this increases the maximum FPS (by allowing the main thread to clock higher) or reduces it (by increasing thread wakeup delays).

(This also occurs in RetroArch's bsnes core.) Oddly, despite ares xshm uses OpenMP (https://github.com/ares-emulator/ares/blob/3ca1f9ebb4ae3f472f7fba661746058f84126536/ruby/video/xshm.cpp#L129), ares running with xshm driver does not exhibit anomalously high CPU usage, remaining at around 105% of a single core.

I do suspect creating one thread per CPU thread is counterproductive on modern 12-thread CPUs, and synchronization/memory contention costs mean that going from ~4 to 12 threads reduces performance (reducing maximum achievable FPS and/or increasing power draw at a given FPS target). I did not test this theory though.

I'm not sure about the best approach for a solution (use futexes rather than spin-waiting, or use less cores in Fast PPU), or how to achieve it (by reconfiguring or moving away from OpenMP). Does bsnes properly separate out memory accessed by different threads into different cache lines, to avoid cache contention? I haven't checked, but if not, fixing that may increase performance as well.

Operating System: Arch Linux KDE Plasma Version: 5.25.1 KDE Frameworks Version: 5.95.0 Qt Version: 5.15.5 Kernel Version: 5.18.5-zen1-1-zen (64-bit) Graphics Platform: X11 Processors: 12 × AMD Ryzen 5 5600X 6-Core Processor Memory: 15.5 GiB of RAM Graphics Processor: NVIDIA GeForce GT 730/PCIe/SSE2 Manufacturer: Gigabyte Technology Co., Ltd. Product Name: B550M DS3H

nyanpasu64 commented 2 years ago

Similar: https://randomascii.wordpress.com/2022/07/11/slower-memory-zeroing-through-parallelism/.

Sunspark-007 commented 1 year ago

This likely explains why on the Steam Deck which is a 4-core/8-thread AMD APU, bsnes will occupy 60% of the system CPU (OpenGL). Snes9x does not have this issue, on the same system it will use 10-15%. I am aware one is cycle-accurate and the other is not, but this isn't the reason for the usage difference.

On a 2-core PC I have running Windows, bsnes will use 30% CPU/20% GPU in OpenGL mode and 54% CPU/11% GPU in D3D mode. Ironically, on Snes9x it is the opposite on that system, it is more efficient in D3D mode (but not a big difference like it is with bsnes). So it's not only the number of cores the system may have, but it is also the rendering path.

Both systems in OpenGL mode, the older one has half the # of cores, and uses half the cpu resources as a %.

On a battery powered device like a laptop and the Steam Deck, this drains the battery.

Perhaps a quick bandaid solution is check to see how many cores a system has, and if it's more than 2, restrict the program to running on 2 cores only? Would be interesting to compare the usage as a % on a single core system. Virtual machines let you specify the number of CPUs the VM has.

endrift commented 1 year ago

I noticed this last night. It's a ridiculously high level of CPU usage. When a game is paused this is still running ,consuming most of my CPUs, doing literally nothing. It's beyond wasteful.