ITotalJustice / notorious_beeg

gba emulator written in c++23
https://notorious-beeg.netlify.app/
GNU General Public License v3.0
41 stars 4 forks source link

optimise ppu rendering #46

Open ITotalJustice opened 2 years ago

ITotalJustice commented 2 years ago

in emerald, without rendering: 1k fps with rendering: 450-460 fps

thats just over half of my fps gone just because of rendering 160 times a frame.

  1. only calculate window tables on window values changing
  2. template rendering to skip blending / windowing checks if disabled
  3. cache decoded screen entries (likely not worth it)
  4. cache decoded oam entries

1: this should give a decent speed up, but still not by much.

2: while this will speed up scenes that dont use windowing and blending, this still isn't ideal because many scenes do use both windowing and blending. so those scenes will still be super slow.

3: decoding is very fast already.

4: this will be a decent speed up.

ITotalJustice commented 2 years ago

another big speedup can come from rendering the bg from highest priority first to lowest.

image

as seen here, all 4 of the bg are rendered entirely. some pixels are transparent in bg2, that are intended for bg3 to be used. bg3 still has to have every single tile parsed and then checked if is transparent.

i could instead render bg2 first, then when rendering bg3, at the top of the loop, check if pixel[x] != transparent, if true then continue.

another example: image about half of bg0 and bg1 rendering can be skipped.

this introduces a problem however for blending. what happens if pixel with a higher priority wants to blend with the pixel below? well thats simple:

if (pixel[x].opaque()) {
  if (pixel[x].can_blend_with_layer(layer_num)) {
    continue;
  }
}

of course, a layer can be enabled to blend with multiple layers, although it can only blend with 1 at a time, so an extra check is needed to see if that pixel has already been blended.


by doing all of this, it removes that merge() function that i have (very slow) and lots of needless tile fetching and decoding. it also means i can work on 1 pixel buffer, rather than 5 (1 obj, 4 bg) and then merging them. also, i can do the blending within the render function itself as either:

  1. the layer doesnt blend
  2. the layers blends to white / dark (self contained)
  3. the layer blends with the above pixel

the good thing is all of these can be templated like so

enum class Blend
{
    None, // no blending
    Alpha, // blend 2 layers
    White, // fade to white
    Black, // fade to black
};

i would need to test if templating is worth it for this, but i predict that it would be.


i would like to optimise for very common cases like this example: image image

where every bg is enabled, but bgX (bg0 in example 1, bg1 in example 2) is entirely empty, yet, i still have to fetch and decode 240 titles! i think the only way to solve this is tile caching.