Closed arfio closed 2 months ago
Can't reproduce with relatively large kernel. Rocprofiler submits additional packets to hsa_queue forcing sched_yield(). There is no additional switches to kernel space.
@arfio Apologies for the lack of response. Can you please check if your issue still exists with the latest ROCm 6.2? If resolved, please close the ticket. Thanks!
@arfio Closing ticket. Please feel free to re-open a ticket if you still see the issue with the latest ROCm. Thanks!
When running an MPI program with rocprof the user time is 39% less than without it. When looking at the Linux kernel trace with LTTng tracer, we can see that the main process for each rank is waiting half the time when rocprof is enabled and it is in running mode without it. When synchronizing the linux kernel trace with the rocprof trace we can see, that this happens with the memory transfer calls.
In the images, blue indicates that the thread is in kernel mode, green, user mode and a yellow line means that the thread is waiting.