[Issue]: `nvidia-smi not found`

joerowell commented 4 months ago

Problem Description

The estimate_matmul functionality in Triton relies rather heavily on the underlying stats of the GPU. On CUDA platforms, this functionality is realised by calling nvidia-smi and then parsing the results. I see that this code is still present in this fork of Triton:

https://github.com/ROCm/triton/blob/35edd6a650e3f6a56e3c2db9a54fdc2f8a6505e1/python/triton/testing.py#L12

Would it be possible to get support added for rocm-smi here instead? This makes autotuning Triton kernels for GEMM etc much easier.

Operating System

-

CPU

-

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

zhanglx13 commented 4 months ago

@joerowell We can add it later after we merge this fork with upstream. For gemm tuning, we have a dedicated script to tune gemm kernels. You can refer to this README for more info and let me know if you have more questions.

zhanglx13 commented 2 months ago

@jataylo @micmelesse This seems to be related to the nvsmi related test failure. What is the status of that test?

ROCm / triton