[Issue]: Error when running tune_gemm.py

Problem Description

Hi, I am getting an error when running the tune_gemm.py script. I am inside a docker container with access to 8 AMD MI300X gpus, displayed when calling rocm-smi and I have no problem running triton. The command I am running is: ./tune_gemm.py --gemm_size_file input.yaml --ngpus 8 --jobs 32 --verbose The content of input.yaml is - {'M': 16, 'N': 13312, 'K': 16384, 'rowMajorA': 'T', 'rowMajorB': 'T'} Here is the produced stack trace (with --job 1 to reduce its size, but it is similar with --jobs 32):
SIZE: 16 13312 16384 TT nConfigs: 880 compile time: 0:00:17.674805                                                                                                                                                                     
profiling /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py on GPU 0                                                                                                                        
RPL: on '241021_230215' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'                                                                                                                                   
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'                                                                                                               
RPL: input file ''                                                                                                                                                                                                                     
RPL: output dir '/tmp/rpl_data_241021_230215_108023'                                                                                                                                                                                   
RPL: result dir '/tmp/rpl_data_241021_230215_108023/input_results_241021_230215'                                                                                                                                                       
ROCProfiler: input from "/tmp/rpl_data_241021_230215_108023/input.xml"                                                                                                                                                                 
  0 metrics                                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init                                                                                                                                       
    queued_call()                                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb                                                                                                                                                 
    default_generator = torch.cuda.default_generators[i]                                                                                                                                                                               
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^                                                                                                                                                                               
IndexError: tuple index out of range                                                                                                                                                                                                   

The above exception was the direct cause of the following exception:                                                                                                                                                                   

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
             ^^^^^^                                                                                                                                                                                                                    
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input                                                                                                                                         
    raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)                                                                                                                                        
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type                                                                                                                             
    temp = torch.randn(size, dtype=dtype, device='cuda')                                                                                                                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                               
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init                                                                                                                                       
    raise DeferredCudaCallError(msg) from e                                                                                                                                                                                            
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range                                                                                                                       

CUDA call was originally invoked at:                                                                                                                                                                                                   

  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input                                                                                                                                         
    torch.manual_seed(seed)                                                                                                                                                                                                            
  File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed                                                                                                                                              
    torch.cuda.manual_seed_all(seed)                                                                                                                                                                                                   
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all                                                                                                                                    
    _lazy_call(cb, seed_all=True)                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call                                                                                                                                       
    _lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())   

ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230215_108023/input_results_241021_230215                                                                                                                     
running rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py one more time                                                                             
RPL: on '241021_230220' from '/opt/rocm-6.2.0' in '/root/triton/python/perf-kernels/tools/tune_gemm'                                                                                                                                   
RPL: profiling '"python" "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py"'                                                                                                               
RPL: input file ''                                                                                                                                                                                                                     
RPL: output dir '/tmp/rpl_data_241021_230220_108139'                                                                                                                                                                                   
RPL: result dir '/tmp/rpl_data_241021_230220_108139/input_results_241021_230220'                                                                                                                                                       
ROCProfiler: input from "/tmp/rpl_data_241021_230220_108139/input.xml"                                                                                                                                                                 
  0 metrics                                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 332, in _lazy_init                                                                                                                                       
    queued_call()                                                                                                                                                                                                                      
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 126, in cb                                                                                                                                                 
    default_generator = torch.cuda.default_generators[i]                                                                                                                                                                               
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^                                                                                                                                                                               
IndexError: tuple index out of range                                                                                                                                                                                                   

The above exception was the direct cause of the following exception:                                                                                                                                                                   

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
             ^^^^^^                                                                                                                                                                                                                    
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 295, in gen_input                                                                                                                                         
    raw_data = init_by_size_and_type((N, M) if needTrans else (M, N), torch.float32, init_type)                                                                                                                                        
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 290, in init_by_size_and_type                                                                                                                             
    temp = torch.randn(size, dtype=dtype, device='cuda')                                                                                                                                                                               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                               
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 338, in _lazy_init                                                                                                                                       
    raise DeferredCudaCallError(msg) from e                                                                                                                                                                                            
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: tuple index out of range                                                                                                                       

CUDA call was originally invoked at:                                                                                                                                                                                                   

  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29961, in <module>                                                                                                         
    sys.exit(main())                                                                                                                                                                                                                   
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 29958, in main                                                                                                             
    test_gemm(16, 13312, 16384, rotating_buffer_size, 0)                                                                                                                                                                               
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py", line 21131, in test_gemm                                                                                                        
    tensors = gen_rotating_tensors(M, N, K, 'fp16', False, 'fp16', False, 'fp16',                                                                                                                                                      
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 333, in gen_rotating_tensors                                                                                                                              
    in_a, in_a_fp16 = gen_input(M, K, dtype_a, need_Trans_a, 1, init_type, device='cuda')                                                                                                                                              
  File "/root/triton/python/perf-kernels/tools/tune_gemm/tune_gemm.py", line 269, in gen_input
    torch.manual_seed(seed)
  File "/opt/conda/lib/python3.11/site-packages/torch/random.py", line 46, in manual_seed
    torch.cuda.manual_seed_all(seed)
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/random.py", line 129, in manual_seed_all
    _lazy_call(cb, seed_all=True)
  File "/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py", line 256, in _lazy_call
    _lazy_seed_tracker.queue_seed_all(callable, traceback.format_stack())

ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_241021_230220_108139/input_results_241021_230220
Process Process-1:
Traceback (most recent call last):
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 36, in run_bash_command_wrapper
    run_bash_command(commandstring, capture)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
    proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 194, in profile_batch_kernels
    run_bash_command_wrapper(
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 40, in run_bash_command_wrapper
    run_bash_command(commandstring, capture)
  File "/root/triton/python/perf-kernels/tools/tune_gemm/utils/utils.py", line 47, in run_bash_command
    proc = subprocess.run(commandstring, shell=True, check=True, executable='/bin/bash')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'rocprof --stats -o results_0.csv python /root/triton/python/perf-kernels/tools/tune_gemm/utils/../profile_driver_16x13312x16384_0.py' returned non-zero exit status 1.
profile time: 0:00:11.341448
Traceback (most recent call last):
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 714, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 644, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
                                                                 ^^^^^^^^^^^^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in tune_gemm_config
    df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/triton/python/perf-kernels/tools/tune_gemm/./tune_gemm.py", line 242, in <listcomp>
    df_prof = [pd.read_csv(f"results_{i}.csv") for i in range(jobs)]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results_0.csv'
Could you help me with this issue please? I would like to tune the matmul in order to continue working on AMD using triton. Thanks.
Operating System

Ubuntu 22.04.4 LTS
CPU

AMD EPYC 9654 96-Core Processor
GPU

AMD Instinct MI300X
ROCm Version

ROCm 6.2.2, ROCm 6.2.0
ROCm Component

No response
Steps to Reproduce

Running the script inside a docker container? My python version is 3.11.10 and triton is 3.0.0 .
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response
Additional Information

No response
ROCm / triton