Open yupinov opened 6 years ago
When collecting performance counters, the profiler will introduce serialization to try to ensure that only one kernel is executing at a time. There is no option for this, as it is the default behavior.
What about measuring performance in real-life environment under concurrent execution?
Additionally this seems to imply that traces in CodeXL can't be used to analyze kernel overlap?
Serialization is only done when collecting performance counters (which is the mode you would use to analyze performance of individual kernels). No additional serialization is introduced when collecting a trace (which is the mode you would use to analyze an entire application (including kernel overlap)).
I see. I'd suggest allowing serialization to be turned on/off.
Is there a way to measure wall-time only without serialization?
Is there an option for making all the kernels execute sequentially (especially when work is launched in multiple queues)? Coming from CUDA and nvprof, I was surprised to not find such a feature for the better kernel performance understanding.