Closed zhanglx13 closed 2 months ago
There seems issue of dealing with M=1 ?
root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
sys.exit(main())
File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
sys.exit(main())
File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
run_bash_command(f"python {fname} -n {num_threads}",
File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
proc = subprocess.run(commandstring,
File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.
There seems issue of dealing with M=1 ?
root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24 Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419 SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last): File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module> sys.exit(main()) File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads) File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0 IndexError: Dimension specified as 0 but tensor has no dimensions Traceback (most recent call last): File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module> sys.exit(main()) File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config( File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config run_bash_command(f"python {fname} -n {num_threads}", File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command proc = subprocess.run(commandstring, File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.
This should be fixed with https://github.com/ROCm/triton/commit/1daec1ff13fb291571985f77ba345c02f3d4f83c
Is there anything need to change for the script one_config.py
?
Is there anything need to change for the script
one_config.py
?
No, since we don't change any API to the script.
@xiaohuguo2023 @vgokhale @scxiao Re --gpu_ids
does not work.
Compilation stage:
During compilation, each thread will query the GPU info, such as torch.version.hip
and utils.get_device_properties()
, to start the compilation flow. Such queries will result in hip runtime functions. And since there are so many threads running in parallel, we see all the GPUs busy at the beginning of the compilation stage.
I tried to set ROCR_VISIBLE_DEVICES=0
to force everyone to use GPU0, but it does not work. All GPUs will still be busy.
And this is not decent, since compilation should not need any runtime functions.
Therefore, I introduced a very hacky option, i.e. --hack_triton_compiler
, which can be used to modify the triton front-end source code and provide a static backend so that the compilation flow can start without running any runtime function.
Profiling stage:
This is very tricky. --gpu_ids
actually works, but in a very surprising way. Since the mapping from ROCR_VISIBLE_DEVICES
and GPU id from rocm-smi is not an identity function, but the following
ROCR_VISIBLE_DEVICES | GPUid |
---|---|
0 | 3 |
1 | 2 |
2 | 0 |
3 | 1 |
4 | 7 |
5 | 6 |
6 | 4 |
7 | 5 |
This could be some settings in my own docker, so could you confirm if this is also the case in your environment?
Another thing regarding the profiling stage. I found that invoking rocprof/rocprofv2 will make all GPUs busy for a very short period of time before the kernel start executing. I suspect this is due to rocprof/rocprofv2 query all GPU information in the system. I'm not sure if we can avoid this, but the GPU busy time is definitely insignificant.
Yeah, I have the similar observation, this is my setting
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6
and my rocm-smi
xiaohugu@smc300x-ccs-aus-GPUF292:~/openai/triton_bench$ rocm-smi
=================================================== ROCm System Management Interface ===================================================
============================================================= Concise Info =============================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
========================================================================================================================================
0 4 0x74a1, 8554 41.0°C 124.0W NPS1, SPX, 0 249Mhz 900Mhz 0% perf_determinism 750.0W 1% 1%
1 5 0x74a1, 19011 40.0°C 117.0W NPS1, SPX, 0 151Mhz 900Mhz 0% perf_determinism 750.0W 1% 1%
2 3 0x74a1, 30036 41.0°C 130.0W NPS1, SPX, 0 233Mhz 900Mhz 0% perf_determinism 750.0W 1% 3%
3 2 0x74a1, 23964 40.0°C 294.0W NPS1, SPX, 0 1402Mhz 1300Mhz 0% perf_determinism 750.0W 1% 26%
4 8 0x74a1, 1197 40.0°C 114.0W NPS1, SPX, 0 158Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
5 9 0x74a1, 41351 39.0°C 114.0W NPS1, SPX, 0 146Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
6 7 0x74a1, 26775 41.0°C 200.0W NPS1, SPX, 0 430Mhz 1300Mhz 0% perf_determinism 750.0W 1% 17%
7 6 0x74a1, 45536 38.0°C 117.0W NPS1, SPX, 0 172Mhz 900Mhz 0% perf_determinism 750.0W 1% 1%
========================================================================================================================================
========================================================= End of ROCm SMI Log ==========================================================
@xiaohuguo2023 Thanks for confirmation. This is weird. I'll file a ticket for this issue. If nothing else, can we do a final round of review and merge this PR?
Please check the README for changes introduced in v3.3.
This PR enables
Example tuning session of 2 gemm sizes
The elapsed time of the kernel is very small, so hw noises play more roles here. This example is to demonstrate the compilation time of the tuning process. One thing to note is that the second gemm's compilation time is much smaller than the first one, indicating cache reuse between the two gemms.
cc+ @xiaohuguo2023 You can try this one on your large-sample stream-K tuning to see if it helps.