ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.54k stars 518 forks source link

[Issue]: Asynchronous execution with hipExtModuleLaunchKernel #3501

Open konradkusiak97 opened 1 month ago

konradkusiak97 commented 1 month ago

Problem Description

It is my understanding that by passing hipExtAnyOrderLaunch as the last argument to this entry point: hipExtModuleLaunchKernel, I could achieve asynchronous execution of the kernels that I'm dispatching.

So I can have a single hipStream_t to which I dispatch my kernels by calling hipExtModuleLaunchKernel, with the above flag for each kernel and they will execute asynchronously, is that correct?

I've been experimenting with it but couldn't achieve this behaviour. I used a single nonBlocking stream but all the kernels I launched with the above entry point were executed synchronously, despite setting the required flag to 1. I inspected that using rocprof and https://ui.perfetto.dev/ as GUI to check if the kernels execute async.

Would you be able to provide me with example of how to use this particular feature to achieve concurrency in a single stream? And how to profile it to see the correct behaviour? Thank you!

Operating System

Ubuntu

CPU

AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

jaydeeppatel1111 commented 1 month ago

Hello @konradkusiak97 , Is it possible to share the sample?

jaydeeppatel1111 commented 1 month ago

Hello @konradkusiak97 , Can you share device info also? Thanks!

konradkusiak97 commented 1 month ago

Hi @jaydeeppatel1111, thanks for the reply. I was experimenting with this feature in our unified-runtime project so I don't have an easy reproducible but I can give it a go at making it.

What I'm really only interested in is an example, for instance an existing test which uses several times the hipExtModuleLaunchKernel with hipExtAnyOrderLaunch flag, submitting a kernel to the same hipStream_t. And then checking (for instance in the profiler) that those kernels indeed run asynchronously.

In any case, I'll try to make a reproducible for that. The device info:

  Marketing Name:          AMD EPYC 7763 64-Core Processor
  Name:                    gfx90a
  Marketing Name:          AMD Instinct MI210
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

Let me know if more verbose output from rocminfo would be better.