ROCm / clr

MIT License
89 stars 46 forks source link

OpenCL Performance: way to extract parallel kernel execution with out-of-order command queue #67

Open preda opened 6 months ago

preda commented 6 months ago

I'm developing an OpenCL application PRPLL/GpuOwl https://github.com/preda/gpuowl/tree/prpll for a primes search project.

The app runs a long series of kernels serially, in a long loop, e.g. let's say this is the sequence of kernels submitted: A, B, C, D, A, B, C, D, and so on. As these kernels are to be run serially, it's natural to use an in-order queue.

So initially we had a single process, with a single in-order queue.

An observation was made that when running two such processes in parallel (independent processes, running on the same GPU), the performance is a bit better than "half". I.e. the agregate throughput was improved by running two processes in parallel on one GPU vs. running a single process on the GPU.

Taking this observation into account, I wanted to reproduce the same behaviour (observed when running two processes) in a single process by running two "logical" streams of kernels in a single process. The logic being that while each stream is serial, there is parallelism between the two streams that can be exploited by the GPU. E.g. we want to run A1,B1,C1,D1 on stream1, and A2,B2,C2,D2 on stream2, then A1 can be executed on the GPU in parallel with any kernel from stream 2. (by "stream" I mean a logical sequence of kernels that must be executed serially/in-order).

My first approach was to use two in-order command-queues, allocating one queue to each logical "stream". But I hit this bug https://github.com/ROCm/ROCR-Runtime/issues/186 which causes one hot thread (100%CPU) and perf degradation when using two queues.

As a consequence, I decided to use a single out-of-order command queue, and model the serial dependence inside the logical streams with OpenCL event wait-lists. Unfortunately after implementing this, I realized that there is no parallelism exploited between the two "streams". It appears that no kernels are executed in parallel at all, even though some could and should be executed in parallel.

Example: let's assume these are the kernels submitted to the out-of-order queue: A1,C2,B1,D2,C1, with dependence modelled through events: A1<B1<C1 and C2<D2. Then A1 and C2 could be run in parallel on the GPU.

(Another scenario is: A1,B1,A2 with the dependence A1<B1; here A1 and A2 are elligible for parallel running though this fact is less obvious. I would hope this parallelism opportunity can be exploited as well).

But this is not what is observed: by timing the kernels, I obtain a profile that is consistent with all the kernels being run serially.

When the kernels are run by way of two processes, I see that the "running" time of the kernels grows (almost doubles) as a consequence of two processes using the GPU in parallel. The kernels from the two processes are effectivelly executed in parallel, and this is seen in the per-kernel running time, and in the overall improved throughput.

But when the kernels are run through an "interleaved out-of-order queue", the running time of each kernel does not increase. That means that each kernel is executed "standalone", and no parallelism is exploited. The agregate throughput is consistent with running serially (lower than when running through two processes).

Basically, I want to be able to obtain the same level of parallelism and performance by running a single process (either with multiple queues, or with a single out-of-order queue) as what is obtained by running two processes with a single in-order queue each.

The story can reproduced using this project (at the given commit, or generally the "prpll" branch): https://github.com/preda/gpuowl/tree/7520fade45359f07f19151085d1dff5480ab29a9 compiling with make in the source folder, executing echo PRP=118845473 > work-1.txt and running with ./build-debug/prpll -d 0 -prp 118063003 -verbose

(basically the above runs two PRP tests for the two numbers mentioned, one in the work-1.txt file, and one on the command line).

preda commented 6 months ago

I'm using ROCm 6.1 RC, on Linux Ubuntu 22.04 kernel 6.8.1, GPU Radeon Pro VII and Radeon VII.

preda commented 6 months ago

I think I found a clue: in clinfo for all my GPUs (Radeon VII and Radeon Pro VII) I see:

  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes

So it seems that the Host Command Queue does not implement out-of-order. OK.

Why is that -- is it a limitation of the hardware (particular GPU models), of the ROCm version, not implemented yet in software, something else?

Thanks anyway. My observation was correct (i.e. the Out-of-order queue is not actually running out-of-order), but the reason was not "a bug" but rather "by design".

preda commented 6 months ago

One more observation: it's obvious that the HW is capable of running multiple compute kernels in parallel, as it does so when the kernels are queued from multiple processes. So it seems that the non-existent "out-of-order" can't be a limitation of the hardware. It's probably more like "not implemented". Are there plans in that direction? Is it implemented in HIP?