Closed scxiao closed 6 months ago
Does cudagraph work on AMD GPUs?
Does cudagraph work on AMD GPUs?
it calls hipgraph in pytorch.
So have you tried this cuda/hip_graph version of do_bench(). Is there any difference?
So have you tried this cuda/hip_graph version of do_bench(). Is there any difference?
The overhead is much smaller than using cuda.event
. For example, for the same kernel, time measured from cuda.event
is:
[0.1031, 0.1054, 0.0689, 0.1085, 0.0767, 0.1069, 0.1039, 0.1060, 0.1031,0.1013, 0.1041, 0.1115]
time from cuda_graph is:
[0.0239, 0.0239, 0.0239, 0.0238, 0.0242, 0.0240, 0.0234, 0.0240, 0.0239, 0.0237]
.
and time variance is smaller.
Were cuda.event and cuda_graph measuring the same application? What is the time from rocprof?
Were cuda.event and cuda_graph measuring the same application? What is the time from rocprof?
cuda_graph does not measure time, it is to reduce kernel launch overhead if there are many back-to-back kernel launches. Here we use cuda.event to measure time of multiple kernel execution in the cuda_graph. The code for time measurement is at: https://github.com/ROCm/triton/blob/bcde44f119b37fe438040c78913fc6455db5df26/python/triton/testing.py#L69-L77.
This change is from the upstream PR https://github.com/openai/triton/pull/3306. Integrate the changes to this fork for more correct tuning results.