Optimize CL Kernels - Githubissues

I have known for a long time that my Open CL kernels are not really optimized at all in terms of register pressure and local memory usage. Originally, I was leaning toward using Vulkan compute at some point, so I didn't want to overly optimize the kernels in case it would be wasted effort. After studying the finer points of memory usage for a while, I now know that the optimizations I could do would actually carry over to Vulkan compute, and doing it right would in fact require adhering to the one key requirement of Vulkan I don't support now, which is statically specifying both work group sizes and local memory sizes.

I also am not in a huge hurry to port to Vulkan now, so I think I will still have these kernels around for some time. As I continue to develop the game tech, it would be nice to have things running more efficiently than they do now, especially on less powerful hardware like my laptop. I am never going to get that system to be as fast as my desktop, but I am certain there are gains to be made there in terms of performance.

This task will be a relatively long term one, as I will need to carefully update all kernels such that they are called with a calculated local work size, instead of the current design where I allow the driver to choose the size for me, and am able to provide arbitrary global work group sizes. Instead, I will need to calculate the global size as a multiple of the local size, and add an argument to every kernel defining what the actual global size would be, with any kernel invocations that meet or exceed that value simply returning early.

Once I do this, I will be better able to further optimize by moving a lot of calculations from private memory to local memory. This is going to make the kernels a lot less readable, so comments will be very important, but it should alleviate cases where the current logic results in register spilling, which I am almost certain is at least a contributing factor in the drastic spikes seen when debugging kernel timings. I notice that the times spike most when more objects are processed by entering or leaving the screen or when the number of items on screen goes above some threshold.

The spikes are not linear, but appear to have some clear limit, likely the point where register data spills into global memory, beyond which the values go very high. As I am making extremely heavy use of local variables, it is a very likely culprit. The processing time should increase somewhat as more items are tracked, but it should be relatively linear, where the extra processing time scales roughly with the number of objects.

I haven't modified any kernels yet, but I have spent a little time examining some of them using the Radeon GPU Analyzer tool, and it's been quite helpful in determining what is going on in these kernels.

Unsurprisingly, the egress_entities kernel is the only one with glaring issues in the tool. I say that because on my laptop, any time the sector boundary is hit, causing entities to egress, there's a noticeable lag. At first I figured this may be due to memory bandwidth just being worse on the laptop, but this clearly shows that there's a bunch of register spilling happening, and that is very likely the core culprit.

So I think I will focus on that kernel first. If I am actually able to solve the stutter on the laptop, that would be pretty nice because aside from a fairly low framerate (which is non terribly unbearable) the stutters are what makes it currently unplayable on that hardware. If I am able to get it to run acceptably that would be great.

Analyzing a bunch of other kernels, focusing on larger ones, it is clear there's also some low-hanging fruit to be had. But I will not delve too deep into those yet as the optimizations that can be done really do make the code a lot less readable. I want to take my time and see if I can figure out a good commenting scheme to try and maintain documentation inside the kernels so modifying them later is not so difficult.

Well so far, not off to the best start. I was able to eke out a modest reduction in vector registers in the egress kernel, but eliminating the scalar register spillage is going nowhere. I am pretty sure the issue is simply the size of the kernel itself. After reading up on what the scalar registers are for, it's not necessarily as scalar values vs vector values, it's scalar for the kernel invocations, i.e. SGPRs are for values that are uniform or "scalar" with respect to the work group. VGPRs are for data that varies, in other words that is "vectorized" across work groups.

Muddying things a bit is the fact that the compiler can sometimes optimize what it stores in scalar registers depending on the kernel code, so it's not impossible that moving access around, reducing uses of scalar values, and other such changes will affect SGPR spillage, it's definitely not as directly impacted by these changes the way VGPRs are.

So this means I will need to take a different approach, following the guidance I found in some AMD presentation materials, I am going to slice up the kernel into smaller kernels. Which honestly, I considered doing before, as it did work for the buffer compaction process. Granted, having everything in one big kernel does make it easier to see everything at a glance. However, it is clearly inefficient, and as a "bonus" the code no longer functions properly on my laptop at all, but the reasons and behavior are non-obvious so it took me some time to figure it out.

Something I noticed right away when trying to run the latest build, which has all the explicit work sizes set, on the laptop the program crashes after the player falls to the ground, and it's this same weird error I sometimes have gotten on the laptop, notably, when things are running slowly or less efficient. This happens pretty regularly when heavy kernels are run on the laptop. Just as a test, i disabled gravity so the player just sits there and sure enough, it doesn't crash until I move enough to unload sectors. After that, the work queue being used just mysteriously becomes invalid, and I am 99.9% sure this is a "watchdog" action from the driver, killing the queue because the invocation took too long.

I also tried reducing the size of the uniform grid significantly, reducing the load on that kernel by simply making the world smaller. Sure enough, it doesn't crash right away and I can ever run left or right for a while before I get the same error. Oddly, running left seems to work for quite a while, where running right crashes, not exactly at the same point, but very near a place where it looks like there's a bunch of water and basalt spikes that load, and I suspect it is just enough to tip it over the edge, likely because of all the register spilling. It would explain why my desktop doesn't have any issue, its just a beefier machine and doesn't suffer as much from the register spilling. I now get up near 200fps on the surface, and it dips down to usually around 110 fps once digging down into the underground. So it just never becomes an issue on the newer hardware. At least this is my theory anyway.

My game plan will be to wire up something like the shift buffers used for the delete calls, and then split things up to look more like the compaction kernels. Possibly even generating them as the smaller compaction one are. Hopefully this helps alleviate the issue and I can get back to looking for other hotspots.

Well, I was able to get a new split kernel setup working, and indeed now the register spilling has been solved, the main entity egress kernel now comes in at 20 under the maximum scalar register usage count (86 out of 106). I probably won't be able to slim that down too much more just because the kernel really does need to do a lot of work at one time make things line up correctly, but at least I'm not getting warnings now on the kernel in the analyzer.

Unfortunately, the program still crashes on the laptop, which is frustrating. I may have to try back tracking to before I added the explicit work sizes and just go one by one until I find which one seems to cause the problem. An early commit that had only a few converted over seems to work fine, so I don't think the issues is with that approach in general. Hopefully there's just some stupid bug i made that the AMD driver overlooks or something.

Figured it out, it was indeed a bug that AMD drivers didn't care about but Nvidia did, I had forgotten to update the prepare liquids kernel to bail when the max hull count was hit. Everything works again on the laptop 👍 Also, while it's not blazing fast and still stutters, it does feel like the stuttering is a tiny bit faster to resolve now, so I will take that small win. Splitting up the kernel was not a wasted effort, even if it had no impact on my desktop.

Will continue on now and see if there's gains to be made in other kernels. I may even see if I can slice up some others to make things faster that way. I still have not experimented with manually spilling to local memory yet, outside of the parallel scan kernels, which I will admit I did mostly by following an example so I wasn't particularly aware of the impact, but in my basic debugger I do see those kernels are quite fast, and in the GPU analyzer, they naturally use a lot fewer registers in general

Not sure if it is some kind of error or misunderstanding on my part, but I notice in the analyzer the scan kernels show 0 for LDS, which is the local storage metric. They all clearly use __local buffers, so I am not sure if maybe that value is just for when there are spills from local to global. In any case, if simply using local memory will in fact just speed things up, I won't be upset. But definitely needs some tests.

Aside from a few minor changes, so far I haven't seen any real worthwhile optimizations to focus on. I may put this back into the backlog and focus on something else, while keeping my eyes open to places where I can do some slicing up of larger kernels. If I make more progress on this will pull the story back out.

controllerface / bvge

Optimize CL Kernels #79