GPUOpen-Tools / gpu_performance_api

GPU Performance API for AMD GPUs
MIT License
250 stars 46 forks source link

Multiple Sesions at the same time #79

Open JinsuaFeito-dev opened 1 year ago

JinsuaFeito-dev commented 1 year ago

Hi, I want to profile two parallel pipelines and I have created two Sessions linked to each pipeline but the results I am getting are quite surprising. Is it possible that the sessions are interfering with each other?

thank you!

PLohrmannAMD commented 1 year ago

It really depends on how you are starting / stopping the samples. It is not possible to be collecting two samples at the same time, even if they are from two different sessions. The API may not be properly preventing this right now though. You would need to synchronize between samples (or the driver doing that automatically for you), which unfortunately means your workloads are no longer running in parallel.

Sessions were originally intended to be used for multiple subsequent profiles and/or to collect data from different sets of counters, but they were not intended to be used simultaneously.

What data are you trying to gather about each individual pipeline while running in parallel? Can this same data be gathered by running them serially?

If you have concerns about the two pipelines competing for HW resources, you'll likely be better served by using Radeon GPU Profiler (https://gpuopen.com/rgp/) to understand the behavior of your application when they are running in parallel.

JinsuaFeito-dev commented 1 year ago

Hi, I want to check how both pipelines compete for HW resources thats why i need them to run in parallel.

I have tried to use the RGP but it does not seem to work with the graphic card i am using (E9171 MCM). Is there any way to make RGP work with this graphic card? Is there any way to profile both pipelines with the GPUperformanceAPI?

thank you!!!

PLohrmannAMD commented 1 year ago

Ah okay, so a slightly older device. Based on my memory, the hardware or driver does not properly support RGP, so there is no way to get it working.

The performance counters are collecting using actual physical hardware, so they are also another "resource" that get used. This is exactly why we cannot have two samples happening simultaneously - the workload of the second pipeline would then be influencing the counter results from the first pipeline. GPUPerfAPI should be preventing this at the API level since it would cause incorrect results. I think you discovered a missing piece in our testing by using two different sessions!

Which rendering API are you using? Can you provide pseudocode to demonstrate how you're executing the various command lists of each pipeline?

I think you may be best served by trying to create a single sample from a single session over the entire pipeline, because this will allow the workloads to be scheduled as usual, and you'll be able to get some insight into how the two pipelines are actually competing for resources. Particularly if you collect the counters over each pipeline individually, then compare it against the two running in parallel.

There is a special GPA integration into Microsoft PIX that can help you get more insight if you're using DirectX12, but unfortunately some of the functionality is not available due to the hardware you're wanting to run on (some of the same issues affecting RGP also impact GPA). On other hardware we're able to graph the number of waves that are actively running on the hardware, and can also graph when a particular resources becomes a limiting factor on being able to schedule more waves.

JinsuaFeito-dev commented 1 year ago

I´m using Vulkan API and i´m actually creating two vulkan instances, each one with one pipeline and their GPA sesssion connected to it. I´m submitting my command buffers,to different compute queues, from different threads in windows and then waiting with a fence until the commands have finished.

When you talk about creating a single session, you mean to create just one session and dispatch both workloads in the same command buffers? In that case, wouldnt be wrong to assume that the resources needed when the shaders are executed standalone are the same as the ones needed when executed in parallel?

I have seen there is a very similar GPU (i think its nearly the same but not embedded), the rtx550 that is suitable with the RGP. Do you think the analysis in that GPU with the RGP could be extrapolated to the E9171 in order to know how they compete for HW resources? rtx 550 (https://www.techpowerup.com/gpu-specs/radeon-rx-550.c2947) vs e9171(https://www.techpowerup.com/gpu-specs/radeon-e9171-mcm.c3028) thank you!!!!