HSAFoundation / HSA-Drivers-Linux-AMD

These drivers have been superseded by ROCm Platform now hosted at Radeon Open Compute GitHub Repo
https://github.com/RadeonOpenCompute
Other
61 stars 15 forks source link

High latencies #18

Open harald-lang opened 8 years ago

harald-lang commented 8 years ago

I noticed very high latencies for kernel dispatches using AQL. Synchronous dispatches take up to 21 µs. Asynchronous (batch) dispatches help to hide latencies. However, kernel dispatching still takes 6 µs (in average), which is still far to slow for fine-grained offloading.

In my experiments I set HSA_ENABLE_INTERRUPT to 0, which greatly improves robustness of the kernel offload times. With interrupts enabled, latencies vary from 6 to 15 microseconds.

System setup:

harald-lang commented 8 years ago

just FYI... the same experiment with a discrete NVidia card connected via PCIe takes 2 microseconds.

harald-lang commented 8 years ago

Update: Setting the iGPU's frequency from "auto" to 720 MHz reduced the dispatch time to ~ 3µs. Installing a dGPU to connect the display doesn't have an impact at all.

aditya4d1 commented 8 years ago

Hi, Interesting observation. Are you still facing this?

harald-lang commented 8 years ago

Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.

aditya4d1 commented 8 years ago

Hi, Can you try spawning a new thread (run an empty function) and time it? It may be the CPU bottleneck. You can try ROC drivers with AMD discrete card with disabling integrated graphics in bios. Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang notifications@github.com wrote:

Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.

— Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180154459 .

Regards,

Aditya Atluri,

USA.

gstoner commented 8 years ago

ROC Driver only runs with Haswell CPU and FIJI based GPU. I am having the team look into this to see if it regression, but it would be help to understand which NVIDIA GPU ( need model number) are you comparing the APU too. Which APU is it and model number.

greg On Feb 4, 2016, at 8:56 PM, Aditya Avinash Atluri notifications@github.com<mailto:notifications@github.com> wrote:

Hi, Can you try spawning a new thread (run an empty function) and time it? It may be the CPU bottleneck. You can try ROC drivers with AMD discrete card with disabling integrated graphics in bios. Thank you for profiling!! (Y)

On Thursday, February 4, 2016, Harald Lang notifications@github.com<mailto:notifications@github.com> wrote:

Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.

— Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180154459 .

Regards,

Aditya Atluri,

USA.

— Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180169638.

aditya4d1 commented 8 years ago

Hi Greg,

How about disabling integrated graphics on APU?

Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180171103 .

Regards,

Aditya Atluri,

USA.

gstoner commented 8 years ago

What are trying to run FIJI card on APU? We only are testing FIJI card with Xeon E5 v3, Xeon E3, I7, I5 Haswell or newer since we need PCIe Gen 3 Platform atomics with the ROC driver and runtime.

greg On Feb 4, 2016, at 9:14 PM, Aditya Avinash Atluri notifications@github.com<mailto:notifications@github.com> wrote:

Hi Greg,

How about disabling integrated graphics on APU?

Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180171103 .

Regards,

Aditya Atluri,

USA.

— Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180172813.

aditya4d1 commented 8 years ago

Hi Harald, Can you try rebuilding the driver? My guess for performance hit can be new thread spawn or ISA loader. Can you try profiling executable.cpp in loader directory? Thanks!

Regards,

Aditya Atluri,

USA.

aditya4d1 commented 8 years ago

Hi Greg, How about the new A10-7890K?

harald-lang commented 8 years ago

Hi Gregory, for the NVidia experiment we used a GTX 650 connected via PCIe 2.0. It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

aditya4d1 commented 8 years ago

Hi Harald, Quick question. What is your target setup? For which configuration do you want to solve this issue? APU? Or AMD GPU? Or NVIDIA GPU?

gstoner commented 8 years ago

Can I get the test your running. I can do A/B test on same hardware with Fiji vs Titan x

Sent from Outlook Mobilehttps://aka.ms/qtex0l

On Fri, Feb 5, 2016 at 3:32 AM -0800, "Harald Lang" notifications@github.com<mailto:notifications@github.com> wrote:

Hi Gregory, for the NVidia experiment we used a GTX 650 connected via PCIe 2.0. It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.

Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...

@adityaatlurihttps://github.com/adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.

Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180308482.

harald-lang commented 8 years ago

Hi Aditya, my target setup is an APU system.

aditya4d1 commented 8 years ago

Hi Harald, Can you please provide the code which you are trying to profile? Or code that can simulate the same behavior? Usual AQL dispatch is writing to queue memory and ringing the doorbell. Also, just making sure there are no other kernel dispatches going on in the system right? Thanks!

harald-lang commented 8 years ago

Hi Aditya and Gregory,

I pushed the code to https://github.com/harald-lang/hsa-lab.

Quickstart instructions:

The output is a little verbose. The important lines start with milliseconds/dispatch = ???.

The dispatch functions can be found in src/rts/hsa/HsaContext.hpp.

harald-lang commented 8 years ago

Update: I ran the tests on a different machine (Godavari APU 7870K on a MSI A88XM mainboard) and it seems that the trick, setting the GPU frequency manually, does not work here. Dispatch times vary from 6 to 12µs (sync) and 3 - 6µs (batch).

aditya4d1 commented 8 years ago

Hi Harald, Thank you for the update. The code looks good. Can you try profiling the vector copy sample? Just to make sure that the system is working fine. Thank you!

harald-lang commented 8 years ago

Hi Aditya,

the vector_copy sample runs without errors.

kaveri: ~/git/HSA-Runtime-AMD/sample$ ./vector_copy 
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is Spectre.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Shutting down the runtime succeeded.

On the Godavari APU, the output looks exactly the same.

The output of the profiler can be found here: https://gist.github.com/harald-lang/b132a4df7863ad4523f2

... by the way... thank you very much for your help! :)

harald-lang commented 8 years ago

Hi Aditya,

I profiled the vector_copy as you suggested. Please refer to https://gist.github.com/harald-lang/b132a4df7863ad4523f2

aditya4d1 commented 8 years ago

Hi, Can you remove check call between start_kernel and end_kernel = clock(). And re-run it? We don't want to profile stdio https://gist.github.com/harald-lang/b132a4df7863ad4523f2

harald-lang commented 8 years ago

Hi Aditya, please, don't tell anyone ;)

I updated the profile at https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Unfortunately, the results are approx. the same.

aditya4d1 commented 8 years ago

Hi Harald, Sure. I am sorry to put you through this. I wanted to make sure that different applications are showing the same behavior. Now its confirmed that the issue with either the drivers or GPU command processor speed. I'll get back to you. Thank you for time and effort you put into this.

aditya4d1 commented 8 years ago

Hi Harald, Check this comment: https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953

Also, here are the numbers we ran on Titan and APU. For Titan, single dispatch it is 40us and for batch it is 11.7uS For APU, single dispatch it is 8us and for batch it is 3 uS.