Closed ITotalJustice closed 2 years ago
I have implemented the above:
on dmg/gbc, I render to a u32[160] then basically memcpy the pixels back over.
on gba, this proved tricky as there's blending, and I cannot perform blending to an unknown format. Because of this, I do everything to a u16[240] like normal, then after merge(), I call the colour callback 240 times for each colour (very slow!)
The gba side is very inefficient. There are a few ways to improve this.
pass the entire pixel array to the callback, so only 1 function is called. Still slow if the colour transformation being done is expensive.
always work on u32 pixels and support a select few pixel formats.
Not a fan of the second option, it means writing even more code in the core which shouldn't be there. The core should just be emulating the gba, not doing fancy colour stuff.
First option seems better.
This occurred to me when I merged my gb emulator with this gba one. The performance was about 12% cpu, whereas the C version was 7-8% CPU. That's a 50% slowdown in GB games!
In the end I found the culprit, it was the SDL texture. Currently I create the texture with bgr555 format. It seems to do some software side converting of the format when locking / updating the texture, rending in itself is fine.
In the C version, I created the texture based on the pixel format of the window.
What I should do is set the pixels from the frontend and set any of the following bpp: 8,16,32. Then have a callback that is called before rendering to convert pram to the correct format. This also allows the frontend to do any changes to the colours.
On GB side, I can easily to pram caching the reduce the number of calls to the callback.
This can be done on the gba side but may be slowers as there will be slight overhead in writes. Also, mode3 doesn't use pram, it uses vram directly https://github.com/ITotalJustice/notorious_beeg/blob/7092c332687d0d1864dbc3c4ffccb0366b86766b/src/core/ppu/render.cpp#L808
The implementation for rendering will now use templates, which accept either u8,u16,u32 based on the bpp. This will bloat out the binary size quite a bit (especially gba side of things).
Maybe a better approach is to render to a 32bit array first, then, loop over the array copying the pixels to the actual pixel array. (This is what I currently do in my gb emulator).