[tuning] gemm tuning script v3.3

ROCm / triton

Development repository for the Triton language and compiler

MIT License

89 stars 27 forks source link

[tuning] gemm tuning script v3.3 #606

Closed zhanglx13 closed 2 months ago

zhanglx13 commented 3 months ago

Please check the README for changes introduced in v3.3.

This PR enables

Put all the kernels to be compiled in a single file ==> this greatly reduces compilation time. In the example of M=N=K=512 which has about 3800 configs, the compilation time reduces from 50 minutes to 1.5 minutes.
Extract GPU kernels of all configs into a separate file and let compile and profile driver files import it. In this way, compile and profile stages can shared the cache.
Allow reuse of compiled kernels across different gemm sizes. This is achieved by
- keeping track of the tuning space across the tuning loop, i.e. at every iteration, we only add new configs from the current gemm size into the tuning space.
- remove M_N_K from the configStr in kernel name
Refactor the script. Now some utility functions are separated from tune_gemm.py

Example tuning session of 2 gemm sizes

~/AMD-triton/scripts/amd/gemm $ python tune_gemm.py --gemm_size_file gemm_config.yaml --ngpus 8 --jobs 32
Tuning 2 gemm sizes starts at: 2024-07-20 22:03:12.604555
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 60.47 time(us): 4.44 best_config: BM32_BN32_BK256_GM4_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:14:22.585976 = 0:01:24.212892 (compile) + 0:12:30.447794 (profile) + 0:00:27.836915 (post processing)
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 75.28 time(us): 3.57 best_config: BM64_BN16_BK128_GM1_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:12:35.324931 = 0:00:19.680196 (compile) + 0:11:52.336533 (profile) + 0:00:23.055614 (post processing)
Tuning ends at: 2024-07-20 22:30:11.100077
Total tuning time (h:m:s): 0:26:58.495522

The elapsed time of the kernel is very small, so hw noises play more roles here. This example is to demonstrate the compilation time of the tuning process. One thing to note is that the second gemm's compilation time is much smaller than the first one, indicating cache reuse between the two gemms.

cc+ @xiaohuguo2023 You can try this one on your large-sample stream-K tuning to see if it helps.

xiaohuguo2023 commented 2 months ago

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

zhanglx13 commented 2 months ago

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

This should be fixed with https://github.com/ROCm/triton/commit/1daec1ff13fb291571985f77ba345c02f3d4f83c

scxiao commented 2 months ago

Is there anything need to change for the script one_config.py?

zhanglx13 commented 2 months ago

Is there anything need to change for the script one_config.py?

No, since we don't change any API to the script.

zhanglx13 commented 2 months ago

@xiaohuguo2023 @vgokhale @scxiao Re --gpu_ids does not work.

Compilation stage:

During compilation, each thread will query the GPU info, such as torch.version.hip and utils.get_device_properties(), to start the compilation flow. Such queries will result in hip runtime functions. And since there are so many threads running in parallel, we see all the GPUs busy at the beginning of the compilation stage.

I tried to set ROCR_VISIBLE_DEVICES=0 to force everyone to use GPU0, but it does not work. All GPUs will still be busy. And this is not decent, since compilation should not need any runtime functions.

Therefore, I introduced a very hacky option, i.e. --hack_triton_compiler, which can be used to modify the triton front-end source code and provide a static backend so that the compilation flow can start without running any runtime function.

Profiling stage:

This is very tricky. --gpu_ids actually works, but in a very surprising way. Since the mapping from ROCR_VISIBLE_DEVICES and GPU id from rocm-smi is not an identity function, but the following

ROCR_VISIBLE_DEVICES	GPUid
0	3
1	2
2	0
3	1
4	7
5	6
6	4
7	5

This could be some settings in my own docker, so could you confirm if this is also the case in your environment?

Another thing regarding the profiling stage. I found that invoking rocprof/rocprofv2 will make all GPUs busy for a very short period of time before the kernel start executing. I suspect this is due to rocprof/rocprofv2 query all GPU information in the system. I'm not sure if we can avoid this, but the GPU busy time is definitely insignificant.

xiaohuguo2023 commented 2 months ago

Yeah, I have the similar observation, this is my setting

export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6

and my rocm-smi

xiaohugu@smc300x-ccs-aus-GPUF292:~/openai/triton_bench$ rocm-smi

=================================================== ROCm System Management Interface ===================================================
============================================================= Concise Info =============================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK     MCLK     Fan  Perf              PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
========================================================================================================================================
0       4     0x74a1,   8554   41.0°C      124.0W    NPS1, SPX, 0        249Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
1       5     0x74a1,   19011  40.0°C      117.0W    NPS1, SPX, 0        151Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
2       3     0x74a1,   30036  41.0°C      130.0W    NPS1, SPX, 0        233Mhz   900Mhz   0%   perf_determinism  750.0W  1%     3%
3       2     0x74a1,   23964  40.0°C      294.0W    NPS1, SPX, 0        1402Mhz  1300Mhz  0%   perf_determinism  750.0W  1%     26%
4       8     0x74a1,   1197   40.0°C      114.0W    NPS1, SPX, 0        158Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
5       9     0x74a1,   41351  39.0°C      114.0W    NPS1, SPX, 0        146Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
6       7     0x74a1,   26775  41.0°C      200.0W    NPS1, SPX, 0        430Mhz   1300Mhz  0%   perf_determinism  750.0W  1%     17%
7       6     0x74a1,   45536  38.0°C      117.0W    NPS1, SPX, 0        172Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
========================================================================================================================================
========================================================= End of ROCm SMI Log ==========================================================

zhanglx13 commented 2 months ago

@xiaohuguo2023 Thanks for confirmation. This is weird. I'll file a ticket for this issue. If nothing else, can we do a final round of review and merge this PR?