After writing my GBA emulator, GBAC-, found here, I wanted to write a new one, but faster. One extra challenge I wanted to add was writing a hardware renderer.
To build, you need SDL
, and optionally capstone[arm]
. ImGui
and glad
are included in the project. You need a GPU that supports at least OpenGL 3.3 to run this project.
So I did exactly that. I rewrote it in C++, with a lot of optimizations. Some examples:
I spent quite a bit of time using the Intel VTune profiler to see what parts actually took a lot of time, and optimized this stuff out.
I used OpenGL (3.3 core) for this. Writing the hardware renderer was a lot of fun, I could copy a lot of code from my old GBA emulator's PPU code. An extra special feature is affine background/sprite upscaling. Basically, instead of rendering at the GBA's resolution (240x160), I render at 4 times that resolution (480x320), and handle affine transforms on a sub-pixel level. This allows for much crisper affine sprites.
One of the biggest challenges was adding alpha blending. The way the GBA handles alpha blending (with different top/bottom layers that only blend with each other), does not map well to modern GPUs. What I did to solve this was render everything twice basically (only when blending is actually enabled though), In one layer, render everything, in the other, render only all non-top layers. Then in an extra pass, blend those 2 layers together.
There are a bunch of shaders involved:
This took a bit more GPU power, but I think the most GPU power consumption comes from buffering the data.
In the hardware renderer I also added a lot of optimizations:
[OPTIONAL] Frameksipping. Frameskipping didn't affect emulator performance much, it mostly saves a lot of GPU usage.
The hardware renderer save a lot of CPU usage, but these extra optimizations also saved a lot of CPU time, and gained me some extra performance in the process.
I recently decided that I wanted to try and write a cached interpreter. I spent some time thinking and writing the code, and after I got it working, I gained another 10-20% boost in most games.
The basic idea for the cached interpreter is that you want to skip the expensive memory reads and instruction decodes when you run code. Especially in the ROM and BIOS regions this is extremely useful, since those regions cannot be written to, and the code will always be the same.
In iWRAM this is a bit different. It's very common for games to run code in iWRAM. What I did was have an "instruction cache page table". Basically, the instruction caches are limited to 256 bytes, aligned by 256 byte page boundaries. Whenever a write to an iWRAM location happens, I look in the page table to see how which addresses are filled (just a vector with those addresses), and clear those addresses. Usually, I would expect there to be at most 1 or 2 blocks in that region (perhaps some more if there are a lot of short branches). Since the stack is also in iWRAM, and I don't want to unnecessarily check for blocks on every push/pop, I limit the iWRAM region by a few hunder bytes. The number I chose here was pretty arbitrary, and I did not test different values of it. With the cache page tables, clearing the blocks did not take long anyway.
Basically, the run loop is no longer fetch -> decode -> execute
, but it's now:
pc
fetch -> decode
the next instruction. Store it in the current
cache block (pointer to call and instruction).
I have tried a bunch of games, most worked fine, some with a few graphical glitches. As for accuracy: I pass all AGS tests, except the ones requiring very accurate timings (those are: the last 3 memory tests, testing waitstates and the prefetch buffer, and the timer tests, requiring cycle accuracy).
Default controls are:
GBA | Keyboard | Gamepad |
---|---|---|
A | Z/C | A / X |
B | X/V | B / Y |
Up | Up | Dpad Up |
Down | Down | Dpad Down |
Left | Left | Dpad Left |
Right | Right | Dpad Right |
Start | A | Start |
Select | S | Select |
L | Q | L |
R | R | R |
If you don't like the controls/want to change them, you can edit the input.map
file.
I am really proud of the performance of my newly rewritten emulator. On my fairly old system (intel i7 2600 and a GTX670). On Pokemon Emerald (notoriously slow for using waitloops), I got framerates of about 1000fps on the intro sequence and about 900 in game (as a reference: this is more than mGBA!). Some games gave me insane performance though: the GTA menu screen spiked over 10k fps and the Zelda menu gave me framerates of about 5kfps, the highest framerate I've seen is in the Doom menu screen, at 17k fps.
In the above screenshot, you can see the emulator in its full potential, with alphablending and everything.
The UI and the debugger are written in ImGui. I tried to keep them as generic as possible, that way I could re-use them for other emulator projects I might do. On release builds, not all the console commands work, the memory viewer still should, and so should the overlay and the register viewer. The decompiler needs capstone.dll
, If you build without capstone installed in your package manager, it won't try to link it, and should just say "Decompiling unavailable"
in the window.
I have included the replacement BIOS that Fleroviux and I made for the GBA (repo is here). If you want to use a different file, you will have to build yourself, and uncomment the gba->LoadBIOS("path");
call and add the right path to your BIOS file, or change the BIOS_FILE
macro and uncomment that call. The replacement BIOS should have all functionality the official one has that is used in games.