Too many CmdBarrier's are inserted resulting in gpu underutilisation

20k commented 2 years ago

I've been pinning this issue down for a while, I'm on windows 10, on 22.3.1 on a 6700xt

When enqueuing a lot of kernels, if the kernels share any common arguments, the driver inserts a barrier which results in a stall. In Radeon GPU Profiler this looks like this

https://i.imgur.com/P5PBmAV.png

This is for the following kernel

__kernel void test(__global int* out)

Where the contents of the kernel do not matter. When invoking multiple copies of this kernel with a single shared read only buffer, each kernel invocation is serialised, with gaps in the GPU execution. This is regardless of if the queue is out of order or not - if any arguments of two kernels overlap, a barrier will be generated. This can be a significant overhead when executing a lot of kernels that all read from the same set of data, but output to many different buffers

As far as I know, a barrier shouldn't be needed just for shared read only buffers right? Is there any way around this at all, or is this a bug/wontfix?

vsytch commented 2 years ago

As far as I'm aware, AMDs open source LLVM compiler doesn't produce metadata needed for the runtime to determine if the buffer passed to the kernel is read only, however this was supported with the closed source compiler and the appropriate optimization should already be implemented in the runtime.

Could you set the env var GPU_ENABLE_LC=0 to force the closed source compiler and recapture the RGP trace? Hopefully there won't be so many barriers anymore.

20k commented 2 years ago

After realising that my IDE scrubs environment variables and being puzzled for a while, setting that environment variable does indeed result in a correct looking RGP trace

https://i.imgur.com/MoKHvBS.png

Which at least answers that mystery! This doesn't fix the problem (because the closed source compiler is a solid 40%-1000% slower which unfortunately is not a typo for the actual usecase I'm debugging), but it makes sense

On the plus side, I think it may be possible to work around this by abusing multiple command queues, which - while not ideal - at least seems to recover some of the lost performance

vsytch commented 2 years ago

@gandryey might have a better solution to your issue, but ultimately AMD's LLVM team would need to support OpenCL buffer attributes, which I do not think will happen.

The closed source compiler doesn't have official support for OpenCL on Navi, it sort of "just" works. That would explain your performance issues with it.

20k commented 2 years ago

Thanks very much for the replies. Do you happen to know if this is something that affects HIP as well? I've been considering switching APIs to get better performance, but I'm slightly reluctant to go full vulkan

20k commented 2 years ago

Some brief testing has shown that using a ring of command queues (in my case 16) to submit parallel work, and using markers to synchronise does seem to successfully recover most of the expected performance - in my case about a 10% decrease (144ms -> 130ms). So that workaround seems to work well enough at least

vsytch commented 2 years ago

Thanks very much for the replies. Do you happen to know if this is something that affects HIP as well? I've been considering switching APIs to get better performance, but I'm slightly reluctant to go full vulkan

With HIP your situation is even worse. Due to Cuda supporting pointers to pointers (also no concept of read only memory), the runtime cannot do any memory tracking, hence there will be a memory barrier inserted after each dispatch.

gandryey commented 2 years ago

Is there any way around this at all, or is this a bug/wontfix?

LC doesn't track read/write operations in the kernel. Hence runtime can't remove the barrier. Potentially you could workaround this issue with CL flag - CL_MEM_READ_ONLY. The apps usually ignore those flags and unfortunately the current logic in runtime relies on the compiler's tracking only to prevent possible incorrect input from the apps.

          // Validate memory for a dependency in the queue
          memoryDependency().validate(*this, gpuMem, **info.readOnly_**);

There are certain tasks where this optimization could help, but usually the app could just use multiple OCL queues in those scenarios. If you think asynchronous execution can help in your case , then please try 2 OCL queues. Please note Windows driver doesn't support user mode submission and to start asynchronous execution the app may need to call clFlush() on the both queues.

20k commented 2 years ago

In the end, I was able to create a longer term solution for me by inspecting the arguments to kernels for their read/write flags, and then automatically distributing work across a number of command queues with events to synchronise between kernels with read/write and write/write dependencies. This does involve rewriting most of the APIs in question to support this model of execution, as you need to be able to inspect cl_mem arguments to all functions like eg clEnqueueWriteBuffer

This was actually much easier to do than I thought it would be which is nice. Its a shame that OpenCL has no way to query kernel arguments directly though, so it can't be done as directly as you might like

20k commented 1 year ago

I've been playing around with this a lot since I wrote this bug report, and its becoming an ever bigger thorn in trying to get acceptable GPU performance out AMD's OpenCL runtime

There's a few problems that are unfixable unfortunately with a multiple queue workaround

Each command queue appears to be a separate driver thread. This means that the solution doesn't scale up, as you end up with issues as threads start fighting each other past a certain number of command queues
The complexity and programmer overhead of building and maintaining a full object tracker + pure event driven workflow + multiple command queue system, while marking up memory manually to work around AMD specific issues is extremely high
There seem to be some instabilities in the driver when you're using events + multiple queues like this, and random segfaults are very common after a few hours of runtime for a simulation. This unfortunately leads to spectacularly hard to diagnose issues, especially as there don't appear to be any non deprecated AMD OpenCL debugging tools for the API

In my case, I'm submitting a hundred or so kernels per 'tick', with each tick lasting ~30ms

Ideally, the memory flags like CL_MEM_READ_ONLY would replace the compiler based system, as while you'd still need to correctly mark memory, it would massively alleviate the amount of work you have to put in to get good performance out of AMD gpus with OpenCL

b-sumner commented 1 year ago

Would you be able to point to some code and instructions for building and running it that will allow others to reproduce the segfault issue you mentioned? While this may be an AMD driver issue, applications leaking events is very common.

ROCm / ROCm-OpenCL-Runtime

Too many CmdBarrier's are inserted resulting in gpu underutilisation #143