Baekalfen / PyBoy

Game Boy emulator written in Python
Other
4.57k stars 472 forks source link

Hardware-accelerated window scaling in SDL #4

Closed Baekalfen closed 5 years ago

Baekalfen commented 7 years ago

The current implementation if Window_SDL2.py does not work well with scaling, as it has a huge impact on framerate.

krs013 commented 5 years ago

Hi, I've enjoyed looking at this project and reading your report, and I've looked through some of the code that would be involved in fixing this issue. A real solution would probably involve sdl2.SDL_BLITSCALED or a better Renderer with sdl2.ext.Renderer but in case I don't get around to doing that, you can make the ScaledFrameBuffer usable by replacing its update(self) method with the following line:

self._array[:,:] = np.repeat(np.repeat(self._cache, self._scaleFactor, 1), self._scaleFactor, 0)

This gives me playable (close to real time, if not a bit faster) frames at 4x scaling, but obviously it's super wasteful because it allocates and copies two intermediate ndarrays every frame. I thought I'd just leave that here rather than opening a PR, but if I get a change to dive into the SDL and get it better I'll do things right.

Baekalfen commented 5 years ago

Thanks for your interest in the project! Do tell, if you need help with something.

I can see your code works substantially faster, than the current one, so I'll push it right away.

krs013 commented 5 years ago

Thanks! I've been looking through the code for the last couple days and I'm just getting to the point where I think I understand things, and I'm trying out an implementation that is based on the Renderer given by sdl2.ext.Renderer which maintains a double buffer that can be drawn to during the scanline operation instead of requiring it all to be done during renderScreen, and then it handles the scaling very smoothly in hardware. It's not working yet, but depending on whether it's faster or not I hope to have something for a PR in the next few days!

If it's not better, then it may be necessary to use SDL's built-in sprite and texture management, which will be perhaps a less faithful emulation but should be very fast. Either way, I've had a lot of fun and learned a good bit about the Gameboy's architecture and that's been great!

Baekalfen commented 5 years ago

That sounds great. With the scanline and renderScreen, just do it the way you feel comfortable, then we can have a look at it afterwards. The reason for recording scanline parameters, and doing it all in renderScreen, was to isolate the rendering, to make a GPU or multithreaded version easier in the future.

No matter if its fast or not, I would like to see what you come up with :)

Interesting with the sprite and texture management. If it's fast enough, I'm sure we can get it to be pixel-perfect, as the Game Boy doesn't have a lot of complicated graphics.

krs013 commented 5 years ago

Well, I put it together to the point that I have tiles background (after still needs some debugging) but no sprites (and window is misaligned!). It's shockingly slow; it gets about 3 frames per second, haha. So that's not going to work, at least not naively. I didn't bother with any caching or anything though, so there's still some potential, but it's not super promising. I made a fork and pushed if you want to look, though :P (oh, and you can resize it, if you're patient enough to wait for the window to respond!)

Also, I realized that SDL's sprite and texture methods probably won't be as handy as I thought, as they would have to deal with single rows of pixels to accommodate the changes that might happen to the viewport between lines. So while it may still be possible to write, it won't be as good as I thought!

Baekalfen commented 5 years ago

I took a look at it. You've changes a lot of things, so it is hard to point at the exact issue. But I'm concerned by the number of calls to draw_point. There is a rather large penalty to calling these functions, that leave Python to perform some C API things. That might be why it doesn't run that well. And also because you removed all of the pretty essential caching ;)

Ah, yes. You are right about the viewport thing.

My plan has been for a while to make a OpenGL (maybe OpenCL) version, which would do all the rendering. But I haven't really found any good examples of pixel-perfect rendering of 2D surfaces.

Baekalfen commented 5 years ago

Or maybe not so much, that draw_point leaves Python, but that it might be doing all sorts of things, that we do not need. We essentially need direct access to a buffer. But does draw_point for example interpolate, if given a float?

krs013 commented 5 years ago

Yeah, from just reading it I didn't really understand everything that was going on, so I tried to make it work based on the parts I already knew and what I got from the report and the Pan docs. Along the way I figured out how the tile cache worked and how that was getting refreshed, and I see the need for that and would probably use something similar on the second round. If I were doing it from scratch, I would maybe build the cache immediately as the ram is written to, since that may be as close as software can get to how the hardware actually behaves (though I could be wrong about that too--is the VRAM just a section of memory or is it part of the LCD controller?).

draw_point spends some time in Python and then gets to C, and I'm not really sure how much goes on behind the scenes here, but there is a good chance that it's just not an effective way to draw. I wrote this to see how fast it would be, and was hoping for faster, but I can't say I'm too surprised. I do think it would be a bit faster if I got the alignment and access right (iterate through 8-sized chunks of x instead of looping and recalculating the tile index for each bit) but that may not be enough to save it, even with a tile cache.

Anyway, the Renderer has a logical size and it expects operations on it to be with integers, and then I believe it immediately maps that operation to its target buffer. This is really nice when you have a fixed-size window to render but want the output to be any size, but in this case there is no physical memory that represents the logical-sized screen it's rendering. It is probably possible to manually create a Texture object that represents the gameboy screen and then is copied/blitted to the window with scaling, and that may be our best option. Another possibility is to represent the tile data table as a Texture and render chunks of that line by line, but it can't be done in 2D since the alignment might change between lines. I like this latter approach because it is more faithful to how the gameboy works, though, and it might be fast enough.

Is there much of a difference between SDL2 and OpenGL? I thought SDL2 was implemented on top of OpenGL on Mac/Linux anyway (fyi, I'm on a Mac most of the time). There might be lower-level methods that let you do buffer things, though.

I think I'm going to try to tweak the version I wrote and see how fast it gets if I align the byte access and maybe skip some of the sdl2.ext components and go directly for the sdl2 calls, and then reintroduce some of the caching. I'm still using the original GameWindow_SDL2 to check my math (ok, I'm mostly just copying the math) but I'm trying to avoid NumPy arrays altogether, so it'll be pretty different. Still, if I can push it to the point where everything is being calculated exactly once, we should have a decent idea of what the upper bound is for speed with this method.

krs013 commented 5 years ago

Actually, it's definitely the calls to draw_point. I changed it to the underlying C++ methods instead of the ones in sdl2.ext and got a >2x speedup, but then I also short circuited calls that draw white on a white background, and that caused the frame rate to vary from 6 to 150 fps. So... this strategy is out. Rendering individual points is just too slow!

Baekalfen commented 5 years ago

The cache can (and should) be built as late as possible, as it is only needed right before rendering a frame. It doesn't affect correctness of the emulation. If we construct it prematurely, we risk running into redundant cache updates. It will likely also avoid the risk of spaghetti code.

I don't know where VRAM resides, but in general, we will need a behavior that is observable by the Game Boy, before emulating a detail is needed. We can do all the tricks we want, when the Game Boy isn't looking.

The Texture ideas are definitely options. The last option, with a tile texture, might be feasible, if we can render each tile with a 8x1 pixel mask, with the designated line of the tile. Remember, that we can save the viewport and window parameters for a whole frame, and later render it.

I'm thinking OpenGL with shaders, that take the raw tiles, the background and windows layout (which tile index goes where), sprite data and viewport/window parameters, and renders the display. Easily doable, if I can find a working example of 2D rendering in OpenGL, preferably within Python. I have no experience in OpenGL, but plenty in OpenCL. They have some interoperability, so in worst-case, if what I want isn't possible in shaders, I know it is easy in OpenCL. And I'm also on a Mac.

Sounds good. I would also prefer a pure Python solution, over one where C code does the heavy lifting.

krs013 commented 5 years ago

So I've been working on the pure SDL2 line-by-line version just for fun, and it runs at about 100% real time (so it's still slower than the NumPy implementation, but that's with no caching so it only gets faster). I based it on this tutorial, if you're interested: http://gigi.nullneuron.net/gigilabs/sdl2-pixel-drawing/

I think we have roughly the same vision of where this could go, though. I imagine a system where each tile in video memory gets cached (and yeah, the lazier the better) in a separate buffer that maybe remembers them even after they're removed, and they can be re-accessed by a hash of the 16 bytes that represent them (instead of reconstructing the tile). Then we have a map of tile map indices that point to the cache, and render the entire background and window layers in a buffer that could be shown for debugging and fun. Then these layers can be copied into the final window with alpha for the sprites, and probably saving the scan line parameters would be more effective than copying line by line during the render as well, since if they all match then the whole viewport can be copied in one shot. (Is that an example of C doing the heavy lifting, though?)

I was concerned that there might be other changes during scanning than the viewport and window position, like possibly palettes. I think I'm wrong about that, but rendering lines in real time seemed like a safer option and more logical while I was experimenting. I like what I've written and I think it's simpler and easier to understand looking at the code and such, but it could definitely be made faster, as I'm still calculating each pixel on each line for each frame in Python.

I've done a little with OpenGL for 3D stuff but not in Python. I imagine direct pixel access in OpenGL is about as awkward as in SDL2, and probably not much faster either. SDL2 does have support for internal color palettes and indexed texture buffers, as I imagine OpenGL does as well, but I gave up on that and went back to 32-bit pixels for the simplicity. Not sure if it would be faster anyway.

Do you have an idea of what kind of speed you're hoping to achieve? If it were up to me I'd prefer to see if SDL2 is fast enough on its own before trying OpenGL, but I'm very stingy with dependencies (e.g. trying to remove NumPy). That said, I don't want to push this in a direction you're not happy with, and I'm okay with leaving some of the stuff I write as just fun experiments for myself.

Baekalfen commented 5 years ago

That sounds quite alright, without caching and such.

Yes, I think we agree on all accounts. What I meant with C doing the heavy lifting, might have been misplaced. It was in regards to having OpenGL shaders written in C, doing all the work, which is kind of cheating, when we call this a Python implementation. But I can live with it, as long as there is a more "pure" Python alternative...

About the extended caching. It's a balance. If the hashing and reindexing takes too long, it might not be worth the saved recalculation of a tile. Most games will keep the tile data static for quite a long time (hundreds or thousands of frames), which means the difference from a simple to an advanced cache, might be small. Pokemon changes a few tiles every few frames, but it doesn't affect performance that much.

Maybe it is more interesting, if the tiles where more expensive to create. For example, if we need to instantiate SDL Sprite objects, or move the data to an OpenGL object, explicitly on the GPU.

There might be other parameters. I haven't checked what happens, if a palette is changed mid-frame. But my assumption is, that a lot of these registers cannot (will not?) change during a frame — but I might be wrong! Maybe #30 has something to do with this.

A software version in OpenGL, I agree, would be as good as SDL2. But a hardware accelerated version, will enable things, I don't see in SDL2. But having yet another GameWindow for testing, doesn't hurt.

I don't have any exact speed I'm going for. But you can try the dummy window, and see how much time is spend on graphics. It just feels like, a too large portion of time is spent on it at the moment. Performance-wise, a great goal would be to get all of PyBoy efficient enough for a Raspberry Pi #35 .

The dependencies doesn't bother me yet. SDL2 and NumPy is pretty concise. I agree imageio, should be made optional.

I'm happy so far, with the direction you are taking. And as long as it's isolated to a GameWindow, you can make it any way you want. If your version of the GameWindow is different from the existing ones, we can add it, and let the most optimal one stay as standard, and the rest as optional.

Baekalfen commented 5 years ago

About the speed. Faster rendering is always welcome, but my wish is mostly to get scaling sorted out. And in the best case, to get it at no penalty (hardware accelerated).

Baekalfen commented 5 years ago

I've made a small breakthrough. I've hacked together a very simple OpenGL version, from my SDL2 version (still software rendering). It scales any amount for free, but it is still a bit rough. It doesn't take keyboard input yet, but I will look into that. It also displays colors in RGBA, instead of ARGB, so it looks a bit weird.

Performance is indifferent from my SDL2 version.

But please, do continue your version, as I really want to see where it could go. I promise, we can include your version. At the very least, as an option.

screenshot 2019-02-28 at 17 28 45

Baekalfen commented 5 years ago

The OpenGL version is actually not bad at all. Everything is the same as SDL2, but it now has scaling. Have a look at it.

krs013 commented 5 years ago

That's great! I can't seem to run the OpenGL version, though. I'm assuming you installed PyOpenGL and PyOpenGL_accelerate via pip_pypy but PyOpenGL_accelerate errors on compile for me. What packages did you install for this?

Also, now that I understand SDL2 a bit more, I might be able to add scaling to the SDL2 GameWindow as well. I'll give that a shot.

Btw, I was curious about Pypy and ran some tests, and my scanline version of GameWindow is actually pretty close to being playable under the normal CPython! The speedup between CPython and Pypy is less than 3x for my code, but I think it's closer to 40x for the SDL2 GameWindow.

krs013 commented 5 years ago

Yep! SDL2 GameWindow has scaling now. Little or no loss in speed as far as I can tell. I'll have to fast forward my fork and then I'll open a PR.

Baekalfen commented 5 years ago

I don't think I have PyOpenGL_accelerate installed. I've done some work with this before, so I actually had all the dependencies already. I'll try to figure out which are needed.

Are you sure about the speed up of Python vs. PyPy? I get something in the range of 15x speed-up in the boot ROM and Pokemon Gold. Which games are you running?

Baekalfen commented 5 years ago

Ah, I get what you meant. Your scanline version is actually performing quite well in Python, while the SDL2 version is performing well on PyPy, but not the other way around. That's weird...

krs013 commented 5 years ago

Yeah, I'm not sure what causes that either, but I was a bit surprised! I did go out of my way to write the most efficient Python code I could, which often involves using builtins wherever possible, and also I tried to make it as algorithmically efficient as possible (within the scanline-cacheless paradigm), but I'm not sure if that could account for the difference under CPython. I wonder if it's the Numpy calls, but that would surprise me, since NumPy is internally very quick, and there shouldn't be too much slow code on the Python side of it. Either way, I'll keep tweaking the scanline GameWindow in the future (or perhaps make an implementation more like the default SDL2 version that doesn't rely on NumPy to see if that helps). If possible, I think it would be really cool to get this running under normal Python, or perhaps get a version that can run in Python3/pypy3 (although Python3 is a good bit slower than Python2 due to how it handles ints).

krs013 commented 5 years ago

So with just the normal PyOpenGL installed from Pypi, I can run other modules and the imports are okay, but if I try to run the OpenGL window I get this error:

Traceback (most recent call last):
  File "main.py", line 86, in <module>
    while not pyboy.tick():
  File "/Users/Kristian/Documents/GitHub/Baekalfen-PyBoy/Source/PyBoy/__init__.py", line 91, in tick
    self.window.updateDisplay()
  File "/Users/Kristian/Documents/GitHub/Baekalfen-PyBoy/Source/PyBoy/GameWindow/GameWindow_OpenGL.py", line 168, in updateDisplay
    OpenGL.GLUT.freeglut.glutMainLoopEvent()
  File "/usr/local/Cellar/pypy/7.0.0/libexec/site-packages/OpenGL/platform/baseplatform.py", line 407, in __call__
    self.__name__, self.__name__,
NullFunctionError: Attempt to call an undefined function glutMainLoopEvent, check for bool(glutMainLoopEvent) before calling

It did seem like FreeGLUT was something that was installed separately, but I wasn't sure where to get that or if a special Pypy version is needed, so I'm just going to ignore it for now and work on other stuff.

Baekalfen commented 5 years ago

I think you'll find, that NumPy works best in CPython, and the calls to C libraries in PyPy is highly penalized. I haven't looked at the two versions, but that is my suspicion.

It would be really nice to get it working in CPython, but I wouldn't get my hopes up. Just the CPU part alone takes a huge toll on the CPU. I think the way to go, it to make all or some of the parts in Cython.

I don't think I did anything special with FreeGLUT. Simply do a brew install freeglut. You can easily ignore it. It is not much different from the SDL2 version. But it might be interesting for RPi, as SDL2 didn't work the last time I tried.