NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
50 stars 37 forks source link

[FEA] show pure Scan op time that excludes GPU semaphore wait time #714

Open wjxiz1992 opened 9 months ago

wjxiz1992 commented 9 months ago

Is your feature request related to a problem? Please describe.

When analyzing from GPU kernel point of view, we want to understand the actual computing time for a kernel. But currently the Scan op time contains the GPU semaphore wait time, which disturbs the performance analysis. image image image

Describe the solution you'd like provide 2 views, one for Op time with GPU semaphore wait time, the other without.

With such clear view, kernel devs can quickly identify the kernel perf issue according to op time.

wjxiz1992 commented 9 months ago

cc @winningsix

wjxiz1992 commented 9 months ago

I doubt if the semaphore wait time is per operation or it's at stage level. If it's not at per-operation level, it may require code refator at spark-rapids side.

revans2 commented 9 months ago

Op time should not include the GPU semaphore wait time. If it does we need to fix it for each operator where it happens.

We had a semaphore wait time metric per operator, but it was really hard to maintain, and impossible to do in all cases. If someone wants to try and put it back in, I would suggest that we try and set a thread local metric for it when processing and then remove it when done instead of trying to pass the metric around. As that got to be really hard to maintain.