OpenCL capture NOT WORKING

eladmaimoni commented 3 years ago

Environment:

Windows 10, AMD Radeon Pro WX7100, RDP v2.3.0.31

I wrote a simple test application that repeatedly enqueues the same kernel each second.

After running the service & configuring the the developer the developer panel to perform OpenCL capturing, I run the application.

I can see the Developer Panel recognizes that the application is running for a split second, and then the application is not highlighted anymore, nothing is captured. Here is your debug log (CTRL + L):

Some general remarks:

This tool is so non intuitive. what happened to the simplicity of Select Application & arguments -> run via tool -> capture a profile?
I have to rely on the tool to detect that my application is running. This is a major burden and failure prone.
I haven't yet heard of one developer who managed to get the tool working properly with OpenCL. Not on this version and not on any other version. Was this feature actually checked under various scenarios?
I would gladly help any AMD developer to explore this issue or actually demonstrate me that this tool works.

Kind regards, Elad

martinschwinzerl commented 3 years ago

Same here, quite literally. Windows 10, RX Vega 64, RDP v 2.3.0.21 Application written with OpenCL + compiled with MSVC 2019 Community Edition (64 Bit)

The small testprogram launches two kernels sequentially (A - B -> repeat) in a loop using OpenCL. Starting to profile manually seems to do something, but neither output nor profiles are created and profiling stops after 1-2 iteration of my program despite me having set the dispatch limit to 100.

My normal development setup is based on ROCm (*buntu 20.04 Linux), but since the ROCm profiler is also not working as expected, I tried port everything to windows route -> no luck either.

(I would have tried to use the Radeon GPU Profiler under Linux but if I understand your feature support matrix correctly, rgp only works for Vulkan applications on Linux, even with the AMDGPU-pro driver, correct?)

I would be very much obligued if somebody could take a look at this!

Cheers, Martin

martinschwinzerl commented 3 years ago

@elad8a: Have you tried with earlier versions of the profiler / driver? I am very reluctant to sink too much time into this, but if you have not tried yet, this may be worth a shot. Thank you!

martinschwinzerl commented 3 years ago

poking around for a little bit longer in the debug log yields this message on a different "tab" (SelectBox with "Thread [5]: 3044" in my case; I strongly second the OP's sentiment about the intuitiveness of the interface, btw.)

counters_not_supported

Please take note of the "Counters are not supported on the current device" message -> Is this a driver issue? My driver version is 27.20.21002.112

Again, thanks for looking into this issue!

Cheers, Martin

eladmaimoni commented 3 years ago

@martinschwinzerl I tried this tool on many driver, HW and software configurations.

It never worked. I remember once that it output a few timelines but never managed to collect more then a few calls.

I still use the old CodeXL which partially works on old hardware.

The main problem here is the lack of care & support from AMD side about everything related to OpenCL. Issues & bugs are rarely given attention. See this post for example - 22 days and not a single response from the development team.

mguerret-amd commented 3 years ago

@martinschwinzerl @elad8a It may be easier to debug if you can provide the full log text which can be located here: %AppData%\RadeonDeveloperPanel\log.txt

martinschwinzerl commented 3 years ago

@mguerret-amd Thank you for your reply. Please find my log file attached to this message. Please let me know if you need any further information or if there are questions.

log.txt

chesik-amd commented 3 years ago

@martinschwinzerl The "Counters not supported on current device" is informational. Your device is a Vega device, and the cache counters feature is only supported on Navi10 devices and newer.

It's also worth noting here, that in the current release the cache counters are not supported for OpenCL either -- we are planning to add this support in a future release. But it will also require a Navi10 or newer GPU.

martinschwinzerl commented 3 years ago

@chesik-amd : thanks for the information, I do not need cache - level metrics right now (would of course be nice down the road), but it would be great to figure out the sources for spilling and branching inside kernels and for that, rgp should hopefully work on a vega 64?

chesik-amd commented 3 years ago

@martinschwinzerl: RGP will report register usages and indicate if scratch memory is used for each kernel (this is shown on the Pipeline state pane). This should work fine on Vega 64. Also, if you haven't used it yet, Radeon GPU Analyzer may be of interest to you for statically analyzing kernels.

As for your original issue, failing to capture profile data: Our QA has tested several OpenCL apps on Vega with current drivers and is not seeing issues. It is possible that your app has some characteristic that is causing problems for the Profiling workflow. Would you be able to provide a test case that fails to capture? If so, we can take a look and hopefully figure out what is going wrong.

martinschwinzerl commented 3 years ago

@chesik-amd Thank you for your reply & sorry for the delayed response from on my part. I've prepared a minimal test-case that I am not able to get any profilings on, it would be greate if you and your colleagues could have a look.

(Note: cmake >= 3.7 is required for this test case) Repository: https://github.com/martinschwinzerl/faddeevas_opencl

Build instructions:

git clone https://github.com/martinschwinzerl/faddeevas_opencl.git
cd faddeevas_opencl
mkdir build 
cd build
cmake .. -G "NMake Makefiles"
nmake

running the run_cerrf.exe program without any parameters prints out the command line parameters list and all available OpenCL devices. On my system, the RX Vega 64 has platform_id = 0 and device_id = 0 running the application for 51200000 work-items and 200 iterations i.e.,

run_cerrf.exe 0 0 51200000 200 base

gives reasonably long run-times for the kernel:

fastest run: 564392 microseconds
slowest run: 623619 microseconds
median     : 565740 microseconds

In line 11 of the CMakeLists.txt file is a variable which can take additional run-time compiler flags to be passed to clBuildProgram. Currently, it defaults to -save-temps and seems to work.

martinschwinzerl commented 3 years ago

@chesik-amd I am aware of the radeon gpu analyzer and have used it to get the rough vector / scalar register picture and have some questions regarding it's handling of included functions in the kernel (will make a separate issue on the correct repository).

The most pressing issue for us is identifying the computationally most expensive lines in the kernel and estimating the portion of the run-time that threads in a wavefront spend idling due to thread divergence / data dependent branching. My understanding is that rga can give us an estimate for the former (based on analysing the dis-assembled compiled kernel) but not the actual run-time cost based on specific arguments to the kernel, correct?

We've done some analysis on Intel GPUs using their VTUNE profiler and since we try to have a single code-base across all pertinent run-times, it would be extremely helpful to evaluate the run-time impact of the optimisations done for the Intel platform on AMD (especially since wavefrontsize == 64 could get really bad if branch divergence is still an issue).

martinschwinzerl commented 3 years ago

@elad8a Sorry for hi-jacking your issue, please let me know if you would prefer if we move the discussion to different issue?

I hope that any progress we might get is useful for you but would understand if this goes into a different direction than you had intended.

martinschwinzerl commented 3 years ago

Sorry again, I did rename the repository to https://github.com/martinschwinzerl/faddeevas_gpu since we are also testing with HIP and rocprofiler -> apparently the old link still works, but just in case anybody has issues, please use the new link.

Apologies for the inconvenience

chesik-amd commented 3 years ago

Thanks @martinschwinzerl. I've built your app and a first look suggests the issues capturing are related to the amount of VRAM being reserved to collect profiling data. It looks like the amount is not large enough to capture even one dispatch worth of data. There may be additional issues as well.

I did notice that using "Auto capture" with only a few dispatches (I used a dispatch range of 3-5) has a higher chance of working than default settings. You can change to auto capture mode and set the dispatch range on the "Profiling" tab of the workflow setting in RadeonDeveloperPanel. I'd be interested in knowing if you have better luck using auto capture with a small dispatch range.

eladmaimoni commented 3 years ago

@chesik-amd this issue also appears for RDNA (Radeon Pro W5700). Will this be fixed in the next release?

Also, I would strongly suggest to allow users to capture profiles on demand and not force them to capture all opencl calls since application startup. Most of the time we are interested in a particular kernel or code that is not necessarily invoked on program start,

Thank you Elad

martinschwinzerl commented 3 years ago

Thank you for testing my application and the hint regarding capturing! I can report some progress, capturing now works if

only manual capturing is enabled
dispatch counter is set firmly to one

I can open the captured profiles & do some analysis, which is definitely an improvement, so thanks for your help in getting to this state. Still, I have some remaining questions that hinder the kind of analysis I would like to perform, any feedback on this would be very much appreciated:

1) While waiting for starting capturing using either the default Profiling workflow or a custom workflow with profiling enabled, the box for "Enable instruction tracing" is grayed out but displayed

waiting_for_capture

Once the running application is registered, the box dissapears and can't be checked

running_capture

Apparently there is an option for instruction tracing in the Settings file %AppData%\RadeonDeveloperPanel\settings.ini file -> setting the value to true gives the checkbox while waiting to start capturing but it still does not appear to have any effect.

I presume that this is not the correct way to enable instruction tracing -> is there anything that has to be passed to the RTC compiler flags (i.e. "-g" , etc.)? or are there additional settings in the rgp configuration required to change?

Also, There is no isa information displayed in when viewing the profile in pipeline state and there seems to be no instruction timing information present in the profile - again, I presume this is all related to the "enable instruction tracing" option missing from the capturing profile.

Any help in getting this to work as well would be a huge step forward, thanks for your help!

Cheers, Martin

GPUOpen-Tools / radeon_gpu_profiler

OpenCL capture NOT WORKING #56