Integrate cudagraph to autotuning

ROCm / triton

Development repository for the Triton language and compiler

MIT License

89 stars 27 forks source link

Integrate cudagraph to autotuning #534

Closed scxiao closed 6 months ago

scxiao commented 6 months ago

This change is from the upstream PR https://github.com/openai/triton/pull/3306. Integrate the changes to this fork for more correct tuning results.

zhanglx13 commented 6 months ago

Does cudagraph work on AMD GPUs?

scxiao commented 6 months ago

Does cudagraph work on AMD GPUs?

it calls hipgraph in pytorch.

zhanglx13 commented 6 months ago

So have you tried this cuda/hip_graph version of do_bench(). Is there any difference?

scxiao commented 6 months ago

So have you tried this cuda/hip_graph version of do_bench(). Is there any difference?

The overhead is much smaller than using cuda.event. For example, for the same kernel, time measured from cuda.event is: [0.1031, 0.1054, 0.0689, 0.1085, 0.0767, 0.1069, 0.1039, 0.1060, 0.1031,0.1013, 0.1041, 0.1115] time from cuda_graph is: [0.0239, 0.0239, 0.0239, 0.0238, 0.0242, 0.0240, 0.0234, 0.0240, 0.0239, 0.0237].

and time variance is smaller.

zhanglx13 commented 6 months ago

Were cuda.event and cuda_graph measuring the same application? What is the time from rocprof?

scxiao commented 6 months ago

Were cuda.event and cuda_graph measuring the same application? What is the time from rocprof?

cuda_graph does not measure time, it is to reduce kernel launch overhead if there are many back-to-back kernel launches. Here we use cuda.event to measure time of multiple kernel execution in the cuda_graph. The code for time measurement is at: https://github.com/ROCm/triton/blob/bcde44f119b37fe438040c78913fc6455db5df26/python/triton/testing.py#L69-L77.