Open harald-lang opened 8 years ago
just FYI... the same experiment with a discrete NVidia card connected via PCIe takes 2 microseconds.
Update: Setting the iGPU's frequency from "auto" to 720 MHz reduced the dispatch time to ~ 3µs. Installing a dGPU to connect the display doesn't have an impact at all.
Hi, Interesting observation. Are you still facing this?
Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.
Hi, Can you try spawning a new thread (run an empty function) and time it? It may be the CPU bottleneck. You can try ROC drivers with AMD discrete card with disabling integrated graphics in bios. Thank you for profiling!! (Y)
On Thursday, February 4, 2016, Harald Lang notifications@github.com wrote:
Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.
— Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180154459 .
Regards,
Aditya Atluri,
USA.
ROC Driver only runs with Haswell CPU and FIJI based GPU. I am having the team look into this to see if it regression, but it would be help to understand which NVIDIA GPU ( need model number) are you comparing the APU too. Which APU is it and model number.
greg On Feb 4, 2016, at 8:56 PM, Aditya Avinash Atluri notifications@github.com<mailto:notifications@github.com> wrote:
Hi, Can you try spawning a new thread (run an empty function) and time it? It may be the CPU bottleneck. You can try ROC drivers with AMD discrete card with disabling integrated graphics in bios. Thank you for profiling!! (Y)
On Thursday, February 4, 2016, Harald Lang notifications@github.com<mailto:notifications@github.com> wrote:
Hi, yes, nothing changed so far. I wonder if something is wrong with my system configuration. It would be very interesting to know, if other HSA developers are also facing this.
— Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180154459 .
Regards,
Aditya Atluri,
USA.
— Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180169638.
Hi Greg,
How about disabling integrated graphics on APU?
—
Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180171103 .
Regards,
Aditya Atluri,
USA.
What are trying to run FIJI card on APU? We only are testing FIJI card with Xeon E5 v3, Xeon E3, I7, I5 Haswell or newer since we need PCIe Gen 3 Platform atomics with the ROC driver and runtime.
greg On Feb 4, 2016, at 9:14 PM, Aditya Avinash Atluri notifications@github.com<mailto:notifications@github.com> wrote:
Hi Greg,
How about disabling integrated graphics on APU?
—
Reply to this email directly or view it on GitHub https://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180171103 .
Regards,
Aditya Atluri,
USA.
— Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180172813.
Hi Harald, Can you try rebuilding the driver? My guess for performance hit can be new thread spawn or ISA loader. Can you try profiling executable.cpp in loader directory? Thanks!
Regards,
Aditya Atluri,
USA.
Hi Greg, How about the new A10-7890K?
Hi Gregory, for the NVidia experiment we used a GTX 650 connected via PCIe 2.0. It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.
Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...
@adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.
Hi Harald, Quick question. What is your target setup? For which configuration do you want to solve this issue? APU? Or AMD GPU? Or NVIDIA GPU?
Can I get the test your running. I can do A/B test on same hardware with Fiji vs Titan x
Sent from Outlook Mobilehttps://aka.ms/qtex0l
On Fri, Feb 5, 2016 at 3:32 AM -0800, "Harald Lang" notifications@github.com<mailto:notifications@github.com> wrote:
Hi Gregory, for the NVidia experiment we used a GTX 650 connected via PCIe 2.0. It was an entirely different system (which is out of my control). If you need more information about the system, I'll contact my colleague.
Alternatively, I can plug in a GTX 970 into the APU system and re-run the measurements...
@adityaatlurihttps://github.com/adityaatluri I'm going to profile the system as requested. I'll post the results ASAP.
Reply to this email directly or view it on GitHubhttps://github.com/HSAFoundation/HSA-Drivers-Linux-AMD/issues/18#issuecomment-180308482.
Hi Aditya, my target setup is an APU system.
Hi Harald, Can you please provide the code which you are trying to profile? Or code that can simulate the same behavior? Usual AQL dispatch is writing to queue memory and ringing the doorbell. Also, just making sure there are no other kernel dispatches going on in the system right? Thanks!
Hi Aditya and Gregory,
I pushed the code to https://github.com/harald-lang/hsa-lab
.
Quickstart instructions:
env.sh
. ./env.sh
cloc
is installedmake bin/tester
bin/tester --gtest_filter=*Perf*
The output is a little verbose. The important lines start with milliseconds/dispatch = ???
.
The dispatch functions can be found in src/rts/hsa/HsaContext.hpp
.
Update: I ran the tests on a different machine (Godavari APU 7870K on a MSI A88XM mainboard) and it seems that the trick, setting the GPU frequency manually, does not work here. Dispatch times vary from 6 to 12µs (sync) and 3 - 6µs (batch).
Hi Harald, Thank you for the update. The code looks good. Can you try profiling the vector copy sample? Just to make sure that the system is working fine. Thank you!
Hi Aditya,
the vector_copy sample runs without errors.
kaveri: ~/git/HSA-Runtime-AMD/sample$ ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is Spectre.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
Create the program succeeded.
Adding the brig module to the program succeeded.
Query the agents isa succeeded.
Finalizing the program succeeded.
Destroying the program succeeded.
Create the executable succeeded.
Loading the code object succeeded.
Freeze the executable succeeded.
Extract the symbol from the executable succeeded.
Extracting the symbol from the executable succeeded.
Extracting the kernarg segment size from the executable succeeded.
Extracting the group segment size from the executable succeeded.
Extracting the private segment from the executable succeeded.
Creating a HSA signal succeeded.
Registering argument memory for input parameter succeeded.
Registering argument memory for output parameter succeeded.
Finding a kernarg memory region succeeded.
Allocating kernel argument memory buffer succeeded.
Dispatching the kernel succeeded.
Passed validation.
Freeing kernel argument memory buffer succeeded.
Destroying the signal succeeded.
Destroying the executable succeeded.
Destroying the code object succeeded.
Destroying the queue succeeded.
Shutting down the runtime succeeded.
On the Godavari APU, the output looks exactly the same.
The output of the profiler can be found here: https://gist.github.com/harald-lang/b132a4df7863ad4523f2
... by the way... thank you very much for your help! :)
Hi Aditya,
I profiled the vector_copy as you suggested. Please refer to https://gist.github.com/harald-lang/b132a4df7863ad4523f2
Hi, Can you remove check call between start_kernel and end_kernel = clock(). And re-run it? We don't want to profile stdio https://gist.github.com/harald-lang/b132a4df7863ad4523f2
Hi Aditya, please, don't tell anyone ;)
I updated the profile at https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953
Unfortunately, the results are approx. the same.
Hi Harald, Sure. I am sorry to put you through this. I wanted to make sure that different applications are showing the same behavior. Now its confirmed that the issue with either the drivers or GPU command processor speed. I'll get back to you. Thank you for time and effort you put into this.
Hi Harald, Check this comment: https://gist.github.com/harald-lang/b132a4df7863ad4523f2#gistcomment-1694953
Also, here are the numbers we ran on Titan and APU. For Titan, single dispatch it is 40us and for batch it is 11.7uS For APU, single dispatch it is 8us and for batch it is 3 uS.
I noticed very high latencies for kernel dispatches using AQL. Synchronous dispatches take up to 21 µs. Asynchronous (batch) dispatches help to hide latencies. However, kernel dispatching still takes 6 µs (in average), which is still far to slow for fine-grained offloading.
In my experiments I set
HSA_ENABLE_INTERRUPT
to0
, which greatly improves robustness of the kernel offload times. With interrupts enabled, latencies vary from 6 to 15 microseconds.System setup: