Port to OpenGL? - Githubissues

Context

Originally posted by @sunjay in https://github.com/ProtoArt/spritec/issues/73#issuecomment-539800874

Now that I've had some time to step away from this for a while, I'm far less attached to the way it is. I'm totally fine with switching completely to OpenGL as long as we are willing to accept the liabilities that come with that. At least Mara and I will need to take on the learning curve of OpenGL and making sure we remain compatible with WebGL in order to keep our options open. We should only switch if we're willing to take on that burden.

This isn't super relevant to this issue, but the software rendering library came up so I thought I'd mention it.

P.S. It may be worth trying out euc 0.4 to see if it addresses any of our issues. They switched to a proper clipping implementation and you can now return 4D vectors from the vertex shader.

Originally posted by @marauder00 in https://github.com/ProtoArt/spritec/issues/73#issuecomment-539809929

I don't want to end up rebuilding the OpenGL pipeline to get all of our necessary rendering features working :D Especially when working with materials, textures, and post-processing effects, OpenGL's features will make the development of these features easier. I think these pros, including platform popularity and developer support, outweigh the initial setup cost.

TLDR, +1 on switching to OpenGL. I will checkout out euc 0.4 nevertheless.

Originally posted by @sunjay in https://github.com/ProtoArt/spritec/issues/73#issuecomment-540304178

@marauder00 The best way for us to switch to OpenGL in my opinion is to start to use glium. That crate provides a really nice wrapper around OpenGL that avoids a lot of the unsafety and common issues that people encounter while still giving you the full power of OpenGL. You can ignore the notice at the top that makes it seem like it's unmaintained. It's actually still very actively maintained, just not by the main person who started the crate. I don't even know why they keep that notice there frankly.

Do you want to start the work to port us to OpenGL? I can open an issue for it and provide some kind of roadmap if that would be helpful. (I haven't looked into it much, but I can do that work if you think it would help you.)

Since posting all of that about a month ago, I've had some more time to dig into this. I'll try to summarize my findings in this issue.

Porting to OpenGL

Porting to OpenGL is important for us to move forward, but I still want to leave the option of continuing with software rendering open until we finish a successful proof of concept. The reasons for this are expanded on below. Certain architectural concerns may end up outweighing the benefit of using the GPU. This will require some exploration and I really don't know how hard it will be just yet.

As another alternative, we may also want to consider using a ray tracer instead of OpenGL. I've already got one written in Rust and we could use that as a starting point. This is not something I am seriously considering just yet.

Pros of OpenGL over Software Rendering

Potentially some performance improvements for larger resolutions
- And even spritesheets if we do some clever rendering hacks (e.g. rendering an entire spritesheet in a single OpenGL buffer rather than as individual images concatenated together)
Very mature platform with lots of online resources and tutorials
Fully featured support for textures, sampling, and many other features that the software renderer we were using (euc), does not support
Familiarity (and potential to move faster)

Cons of OpenGL versus Software Rendering

Not guaranteed to improve rendering times for large batches of individual sprite images
- I say this because it may turn out that the overhead of using the GPU outweighs the rendering cost given that all of our rendered images have so few pixels
OpenGL bugs that we may or may not be able to work around
Any use-case not directly supported by OpenGL is going to be very difficult to support
Multi-threaded rendering of multiple images may be complex (see below)

Discussion

Performance, feature-set, and familiarity are the three main reasons we're considering porting to OpenGL. We want to make sure our software scales to huge numbers of sprites (since otherwise there really isn't a point) and we want to make sure we can comfortably add new features to our renderer.

One of the issues that's been holding us back on the latter point is the fact that the euc crate is still pretty immature. It works as a basic software renderer, but still needs a lot of work to support more rendering features. All of these features come out of the box with OpenGL. While euc 0.4 did address a lot of the issues we originally had (e.g. by adding the perspective divide we had to add manually), there is still a lot missing that we might want to use.

As far as familiarity goes, we have more people familiar with OpenGL than people who are comfortable implementing graphics algorithms in Rust. This is a real issue that has been slowing us down since we started this project.

It is certainly nice that software rendering gives you the ability to swap out parts of the pipeline for better implementations. It's also nice that you can fix bugs in software rendering libraries (whereas doing that with OpenGL is pretty much impossible).

Multi-threading

One of my biggest concerns about the port, and something I hadn't thought about before, is multi-threading. Because our rendering is CPU-based, we are able to fully utilize all of the cores of the user's computer and use rayon to make sure we render everything as fast as possible using the user's entire machine. This has worked very well so far and we have had great performance even with large workloads (~400 sprites).

The problem that OpenGL presents is that we're not writing a typical OpenGL application. We don't have a single window that we want to render to. In OpenGL terminology, we want to render "off-screen" to multiple FBOs (Frame Buffer Objects), copy those buffers to CPU memory, and then save that data to the filesystem as images (spritesheets). Managing all of this and still making sure that we're efficient may not be easy. One of the biggest factors in determining if this is the way to go will be whether we can figure out a way to use OpenGL to outperform our current code in a way that doesn't make our current architecture overly complex. It may make our code difficult to manage if we have to constantly tell OpenGL which object we're currently trying to draw to.

Note: I don't know if it will make it too complex. That's why we're going to try to port things and see how it goes.

Libraries

Given that OpenGL isn't great when it comes to multi-threading, we may want to consider using something beyond just the glium crate I originally suggested. Glium doesn't directly support multi-threading. (That being said, since glium is written in Rust, it also doesn't allow you to use the API incorrectly--there are still no data races!)

The crayon crate is a full game engine which supports multi-threaded rendering. Using a full-game engine might be overkill, but we still may want to look at their rendering code to see how they do things. Also, it looks like they support making draw calls from multiple threads. This isn't necessarily what we're trying to support. We want to make sure you can draw on distinct buffers from different threads, not necessarily the same buffer from different threads.

The radiant-rs crate claims to be a "Rust sprite rendering engine with a friendly API, wait-free send+sync drawing targets and custom shader support." This sounds exactly like what we need but using it would require some more research to see if it is really feasible.

Finally, if it turns out that using glium alone is sufficient, we may want to look into the framebuffer module to see if we can use to draw to distinct FBOs from multiple threads.

Summary

Overall, I want to express that it's not important to me whether we use software rendering, OpenGL, ray tracing, or anything else. The most important thing for me is that we end up with something performant and productive. This is the last and only time that we are switching to something, so let's make sure we can actually start producing features quickly once we finish this work.

:tada:

Update: I spoke to Sam about my concerns and he said that we may not run into many issues if we just use multiple OpenGL contexts. Should work to use one context per thread. We may also be able to render a full spritesheet from the same context if we use clipping to separate each image. Note that it's important that we don't try to use the same OpenGL context from multiple threads (not that glium / Rust would let us anyway).

That means that this may not be so bad after all! :tada:

Some more notes from my conversation with Sam:

GPUs are very good at parallel stuff, but usually you only have one render thread
GPUs aren't multithreaded by design, so you'll probably hit overhead from context switching (which is pretty heavy)
You want to minimize state changes for perf
Since our images are tiny (32x32, 64x64, etc.), we should batch things together and render to one big buffer (using clipping to separate the images from each other)
Build a larger buffer before even doing a draw call
Look into geometry shaders
"honestly I'd try just the naive single pass for everything and see how slow it is. Then try and limit state changes as much as possible. THEN if its too slow try some cleverness"

Thank you so much for looking into this Sunjay :D !! Lots of great points here, I'll do my best to address as many as I can:

I agree with Sam that the likelihood of us running into issues implementing our system in OpenGL is very small. I have experience rendering to offline buffers in OpenGL and DX12 so I feel like I can do it again if needed :D . However, I do acknowledge and recall the pain of reporting driver bugs. I suggest we do the port and switch back once things get really ugly, but I doubt it will.
Another performance concern to keep in mind when writing how we stitch frames onto a sprite sheet is the amount of data and time spent reading resources between the GPU and CPU (GPU->CPU bandwidth is a lot lower than GPU->GPU bandwidth, and is a known bottleneck in games).
Three options come to mind when considering how we can implement this stitching process with/without OpenGL:

(what you've suggested) Use two FBOs, one with the current frame we are processing and one that represents the final sprite sheet. Write a VS that processes the final sprite sheet quad, and a PS that places a given offline buffer within the final sprite sheet. The PS can have uniforms such as frame number to be processed, total # of frames, dimensions of sprite, and dimensions of the sprite sheet.
Do the first option, but perform the stitching with a compute shader (OpenGL has them). The compute shader can easily copy values from the frame texture to the sprite sheet texture and can be easily parallelized (yay GPGPU computing!), and indexing + placement is v easy to write too (group thread IDs and thread IDs are blessed). For performance benefits, we can experiment with things like copying 3 frames each time into the final buffer. Dynamically dispatching groups of work is a LOT easier too with compute shaders. I have experience writing them in DX12 so I'm comfortable doing this.
Do the first option but perform the stitching on the CPU side. We already have code for the stitching, so this would be less work overall, and maybe we can stick to this for now. The second option would be nice to consider in the future as our workloads get massive.

Something crazy I would want to try in the distant future is see what happens if we render instances of the same mesh in different poses, arrange them in a grid in the same space,, and just rely on the wonders of projection to turn that scene into a spritesheet. All that would be needed is figuring out is where to place the camera to get everything in perfect view. :D

Also, about geometry shaders, all I can say is that the future is looking like we are moving away from them. To put things into context, Mesh shaders are coming to DX12 (and Vulkan) and their existence will basically make geoemtry shaders obsolete: https://devblogs.microsoft.com/directx/coming-to-directx-12-mesh-shaders-and-amplification-shaders-reinventing-the-geometry-pipeline/ I understand why he's considering using geometry shaders, but they are infrequently used in practice, AMD's implementation is slow and we want a cross-plat win, and we can use compute shaders to -basically- do the same thing.

These are great points! Thank you for taking the time to write them out! (twice!)

I will keep all of this in mind as I implement a prototype and start porting our code to use OpenGL. It sounds like the next logical step is for me to just get to work! :tada:

Aside: I remembered another consideration I forgot to write down earlier: Platform support. We have to be very careful when choosing our OpenGL version and when writing our code to make sure we continue to have good platform support.

In general, we want to make sure we continue to support all three major platforms: Windows, Mac, and Linux. Between the members of our team we have users on all of these so that hopefully won't be a big issue.

There are some general questions that I would like to have answered by the time we finish this work:

What if someone doesn't have a GPU (mesa, software emulation, etc.)
Which platforms wouldn't be supported? (Do we care?)
Will we be able to use Travis CI still? (Build machines are very limited and may not have the hardware/drivers we need)

I am somewhat arbitrarily choosing GLSL 1.40.08 (OpenGL 3.1, Nov 22, 2009). This is the version used in the tutorial and it is recent enough to be useful. If there are any objections to this I am happy to bump it as long as we don't limit platform support too much. We probably don't want to go below OpenGL 3.

Okay I put some more thought into this and I think I pretty much have a plan for getting this implemented (at least the first few steps). A lot of my thought went into considering what I actually want to parallelize. At first I thought I'd have a thread pool for sending draw calls to the GPU in parallel on multiple threads. After reading about things a bit more, I realized that this doesn't really make sense since the GPU will already parallelize that stuff for me. I'm really just creating more contention / context-switching overhead.

I'm going to start by implementing it so we upload all the VBOs first, and then make draw calls. Once I have that working, I want to experiment with uploading VBOs and making draw calls in parallel on two separate threads with two separate OpenGL contexts. We can share the VBOs between the contexts, so there should be no issues with that (in theory).

To achieve max parallelism, I want to ensure that we get drawing as soon as possible. I don't actually know enough about GPUs to know if we can upload data and draw at the same time, that's why I'm starting with something simple so we can test the difference.

I also have other goals like avoiding uploading duplicate data to the GPU. I realized that we weren't really doing any caching before. This isn't such a big deal with the CPU, but with the GPU we're copying a lot and I'd like to avoid that if possible. I'll cache on the filename of the model for each pose/spritesheet and avoid loading the same data more than once.

Regarding the question of platform support, I am going to rule out WebGL support for the timebeing. I don't want to do the work to make sure our OpenGL shaders are also WebGL compatible.

If we really want to host our software as a web page at some point we can deal with it then. This doesn't affect our ability to use Electron for GUI (if we still ever go in that direction). We can always ship our Rust code as a native Node module (rather than a WASM module as we've been planning so far). This only affects our ability to create a web version of our software. That is not a high priority item at the moment.

ProtoArt / spritec

Port to OpenGL? #100