Hi, I am getting an error when running the tune_gemm.py script.
I am inside a docker container with access to 8 AMD MI300X gpus, displayed when calling rocm-smi and I have no problem running triton.
The command I am running is: ./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose
The content of input.yaml is - {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'}
Here is the produced stack trace (with --job 1 to reduce its size, but it is similar with --jobs 32):
SIZE: 16 13312 16384 TT nConfigs: 880 compile time: 0:00:17.674805
profiling /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py on GPU 0
RPL: on '241021_230215' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_241021_230215_108023'
RPL: result dir '/tmp/rpl_data_241021_230215_108023/input_results_241021_230215'
ROCProfiler: input from "/tmp/rpl_data_241021_230215_108023/input.xml"
0 metrics
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init
queued_call()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb
default_generator = torch.cuda.default_generators[i]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input
raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type
temp = torch.randn(size, dtype=dtype, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range
CUDA call was originally invoked at:
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
torch.manual_seed(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())
ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230215_108023/input_results_241021_230215
running rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py one more time
RPL: on '241021_230220' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_241021_230220_108139'
RPL: result dir '/tmp/rpl_data_241021_230220_108139/input_results_241021_230220'
ROCProfiler: input from "/tmp/rpl_data_241021_230220_108139/input.xml"
0 metrics
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init
queued_call()
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb
default_generator = torch.cuda.default_generators[i]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: tuple index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input
raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type
temp = torch.randn(size, dtype=dtype, device='cuda')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range
CUDA call was originally invoked at:
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>
sys.exit(main())
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main
test_gemm(16, 13312, 16384, rotating_buffer_size, 0)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm
tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors
in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')
File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
torch.manual_seed(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
torch.cuda.manual_seed_all(seed)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
_lazy_call(cb, seed_all=True)
File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
_lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())
ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230220_108139/input_results_241021_230220
Process Process-1:
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 36, in run_bash_command_wrapper
run_bash_command(commandstring, capture)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 194, in profile_batch_kernels
run_bash_command_wrapper(
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 40, in run_bash_command_wrapper
run_bash_command(commandstring, capture)
File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
profile time: 0:00:11.341448
Traceback (most recent call last):
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 714, in <module>
sys.exit(main())
^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 644, in main
minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in tune_gemm_config
df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in <listcomp>
df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
self.handles = get_handle(
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
handle = open(
^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results_0.csv'
Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.
Operating System
Ubuntu 22.04.4 LTS
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.2, ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Problem Description
Hi, I am getting an error when running the
tune_gemm.py
script. I am inside a docker container with access to 8 AMD MI300X gpus, displayed when callingrocm-smi
and I have no problem running triton. The command I am running is:./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose
The content ofinput.yaml
is- {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'}
Here is the produced stack trace (with--job 1
to reduce its size, but it is similar with--jobs 32
):Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.
Operating System
Ubuntu 22.04.4 LTS
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.2, ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response