Bug for the scripts/tune_gemm.py

ybai62868 commented 8 months ago

Thanks for the good work!

When I use the file in /scripts/amd/gemm/tune_gemm.py, I want to tune some specific sizes of the GEMM to do the benchmark. When I set "m=17, n=5120, k=5120". It takes long long time (maybe more 1 day), it still can not get the final yaml file for this shape of GEMM. I use 'htop' to see the pid, find that runs the "generated_kernel17-5120-5120-0.py". Is there any bug for that part?

Thanks, Yang

ybai62868 commented 8 months ago

It also will have this error report: subprocess.CalledProcessError: Command 'python3.10 generated_kernel17-5120-5120-0.py -n 64' died with <Signals.SIGKILL: 9>

ybai62868 commented 8 months ago

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/pool.py", line 136, in worker put((job, i, (False, wrapped))) File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/queues.py", line 378, in put self._writer.send_bytes(obj) File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/connection.py", line 205, in send_bytes self._send_bytes(m[offset:offset + size]) File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/connection.py", line 416, in _send_bytes self._send(header + buf) File "/root/miniconda3/envs/triton/lib/python3.10/multiprocessing/connection.py", line 373, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

zhanglx13 commented 8 months ago

@ybai62868 Thank you for trying it out. I saw such error before on some node. What GPU are you using?

ybai62868 commented 8 months ago

Thanks for your reply. I use the MI210.

Lixun Zhang @.***> 于2024年1月9日周二 22:05写道：

@ybai62868 https://github.com/ybai62868 Thank you for trying it out. I saw such error before on some node. What GPU are you using?

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1883109583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWFS6XI75VRNG7BXMCDYNVFC7AVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBTGEYDSNJYGM . You are receiving this because you were mentioned.Message ID: @.***>

zhanglx13 commented 8 months ago

@ybai62868 It should be fixed. Can you give it another shot?

ybai62868 commented 8 months ago

Hi @zhanglx13,

Thanks for the quick reply. After change the code and add the skip condition in the tune_gemm.py It seems that the program is hanging for a very long time and I can not get the tuning results of m=17, n=5120, k=5120 on MI210. Can you get the results in the proper time?

Thanks!

zhanglx13 commented 7 months ago

This is my command and output

python ./tune_gemm.py --gemm_size_file one_gemm.yaml --ngpus 8
Tuning starts at: 2024-01-10 08:18:10.218710
SIZE: 17 5120 5120 nConfigs: 11520 TFLOPS: 16.53 time(us): 53.919 best_config: M17_N5120_K5120_BM16_BN128_BK32_GM1_SK4_nW4_nS0_EU2 
>>> Elapsed time: 0:15:17.320218 = 0:14:15.022247 (compile) + 0:00:29.963957 (profile) + 0:00:29.229575 (post processing)
Tuning ends at: 2024-01-10 08:33:27.539272
Total tuning time (h:m:s): 0:15:17.320562

It takes about 15 minutes with 8 GPUs to tune all the 11520 configs. I am using a MI250X node. You can expect slightly worse numbers on MI210. If you want to do a quick tuning, you can change waves_per_eu_range to [0] https://github.com/ROCmSoftwarePlatform/triton/blob/ce9dacec725b45ec2213e2cc3c79dacdea6dcc1e/scripts/amd/gemm/tune_gemm.py#L30 and it reduces the total number of configs to 2305.

You can also add --verbose to see if it hangs at a particular step.

ybai62868 commented 7 months ago

Cool! Thanks for your help！ I only have one MI210 GPU. I will double check it! Thanks for your help!

Lixun Zhang @.***> 于2024年1月10日周三 22:43写道：

This is my command and output

python ./tune_gemm.py --gemm_size_file one_gemm.yaml --ngpus 8 Tuning starts at: 2024-01-10 08:18:10.218710 SIZE: 17 5120 5120 nConfigs: 11520 TFLOPS: 16.53 time(us): 53.919 best_config: M17_N5120_K5120_BM16_BN128_BK32_GM1_SK4_nW4_nS0_EU2

Elapsed time: 0:15:17.320218 = 0:14:15.022247 (compile) + 0:00:29.963957 (profile) + 0:00:29.229575 (post processing) Tuning ends at: 2024-01-10 08:33:27.539272 Total tuning time (h:m:s): 0:15:17.320562

It takes about 15 minutes with 8 GPUs to tune all the 11520 configs. I am using a MI250X node. You can expect slightly worse numbers on MI210. If you want to do a quick tuning, you can change waves_per_eu_range to [0] https://github.com/ROCmSoftwarePlatform/triton/blob/ce9dacec725b45ec2213e2cc3c79dacdea6dcc1e/scripts/amd/gemm/tune_gemm.py#L30 and it reduces the total number of configs to 2305.

You can also add --verbose to see if it hangs at a particular step.

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1884981445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWFI44BYFNS42A7GLWTYN2SKRAVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUHE4DCNBUGU . You are receiving this because you were mentioned.Message ID: @.***>

zhanglx13 commented 7 months ago

@ybai62868 Just reopen this ticket if it doesn't work for you.

ybai62868 commented 7 months ago

OK! I get the correct tuning configs of each GEMM operators and when I explore the performance of each kernel generated by Triton. I find some important tuning knobs. Such as num_warps and waves_per_eu. I only have the experience on CUDA programming and I know that the num_warps is the warps count in each threadblock. And each warp has 32 threads. In AMD programming, it seems that each warp will have 64 wavefronts. Therefore, the number of the eu used is equal to (64 * num_warps / waves_per_eu). Am I right? Thanks!

zhanglx13 commented 7 months ago

@ybai62868 I made several changes recently to tune_gemm.py. I hope you are trying the latest version. Let me explain waves_per_eu knob. AMD GPUs consists of CUs (compute unit), which is the counterpart of SMs on NV GPUs. In each CU, there are 4 SIMD units (also called EU (execution engine)). You can think of SIMD unit as a vector execution unit, which has a number of registers and ALUs to do the computation. When you launch a grid, workgroups (threadblocks as on NV gpus) are scheduled on CUs. In the CU, wavefronts (warps as on NV GPUs) are scheduled on SIMD units. Here comes the concept of occupancy, which means the number of wavefronts that can run concurrently on each SIMD unit. This depends on how much resource each wavefront requires and how much resource each SIMD unit has. The waves_per_eu parameter focuses on register usage. For example, each SIMD (EU) has 512 registers. And if each wavefront requires 256 registers, then the occupancy is 2. But if we set waves_per_eu=3, the compiler will try to reduce the register usage per wavefront to 170, so that occupancy can be 3. But there is a risk and penalty of register spilling. So increasing waves_per_eu will increase occupancy, but not necessary increase performance.

ybai62868 commented 7 months ago

Thank you for the information!

Lixun Zhang @.***> 于2024年1月24日周三 01:56写道：

@ybai62868 https://github.com/ybai62868 I made several changes recently to tune_gemm.py. I hope you are trying the latest version. Let me explain waves_per_eu knob. AMD GPUs consists of CUs (compute unit), which is the counterpart of SMs on NV GPUs. In each CU, there are 4 SIMD units (also called EU (execution engine)). You can think of SIMD unit as a vector execution unit, which has a number of registers and ALUs to do the computation. When you launch a grid, workgroups (threadblocks as on NV gpus) are scheduled on CUs. In the CU, wavefronts (warps as on NV GPUs) are scheduled on SIMD units. Here comes the concept of occupancy, which means the number of wavefronts that can run concurrently on each SIMD unit. This depends on how much resource each wavefront requires and how much resource each SIMD unit has. The waves_per_eu parameter focuses on register usage. For example, each SIMD (EU) has 512 registers. And if each wavefront requires 256 registers, then the occupancy is 2. But if we set waves_per_eu=3, the compiler will try to reduce the register usage per wavefront to 170, so that occupancy can be 3. But there is a risk and penalty of register spilling. So increasing waves_per_eu will increase occupancy, but not necessary increase performance.

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1906615094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWDZS7Y6RKYZXBL5V6LYP72UDAVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGYYTKMBZGQ . You are receiving this because you were mentioned.Message ID: @.***>

ybai62868 commented 7 months ago

Hi, another questions！ If I want to use the kernel generated by triton-amd to other projects, how to do it? I mean can I export to the C++ code with specific optimizations on the code such as m,n,k blocking, l2 cache hit rate improvement or the optimization I mentioned in the previous post. It seems like I want to know the final code, maybe not the C++ format because triton can lower the python DSL to the LLVM IR. So I am curious about how to use the generated code by triton for my personal project.

Thanks, Yang

Yang Bai @.***> 于2024年1月24日周三 12:22写道：

Thank you for the information!

Lixun Zhang @.***> 于2024年1月24日周三 01:56写道：

@ybai62868 https://github.com/ybai62868 I made several changes recently to tune_gemm.py. I hope you are trying the latest version. Let me explain waves_per_eu knob. AMD GPUs consists of CUs (compute unit), which is the counterpart of SMs on NV GPUs. In each CU, there are 4 SIMD units (also called EU (execution engine)). You can think of SIMD unit as a vector execution unit, which has a number of registers and ALUs to do the computation. When you launch a grid, workgroups (threadblocks as on NV gpus) are scheduled on CUs. In the CU, wavefronts (warps as on NV GPUs) are scheduled on SIMD units. Here comes the concept of occupancy, which means the number of wavefronts that can run concurrently on each SIMD unit. This depends on how much resource each wavefront requires and how much resource each SIMD unit has. The waves_per_eu parameter focuses on register usage. For example, each SIMD (EU) has 512 registers. And if each wavefront requires 256 registers, then the occupancy is 2. But if we set waves_per_eu=3, the compiler will try to reduce the register usage per wavefront to 170, so that occupancy can be 3. But there is a risk and penalty of register spilling. So increasing waves_per_eu will increase occupancy, but not necessary increase performance.

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1906615094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWDZS7Y6RKYZXBL5V6LYP72UDAVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGYYTKMBZGQ . You are receiving this because you were mentioned.Message ID: @.***>

zhanglx13 commented 7 months ago

@ybai62868 Sorry I'm not able to give you a clear direction about how to do it. But this is not a AMD specific question. You can ask openAI folks. They must have experience with this kind of integration. https://github.com/openai/triton/issues

ybai62868 commented 7 months ago

Thanks for the tip!

Lixun Zhang @.***> 于2024年1月24日周三 22:43写道：

@ybai62868 https://github.com/ybai62868 Sorry I'm not able to give you a clear direction about how to do it. But this is not a AMD specific question. You can ask openAI folks. They must have experience with this kind of integration. https://github.com/openai/triton/issues

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1908270605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWBJZNVQ5CIW6HPGO2TYQEMXHAVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGI3TANRQGU . You are receiving this because you were mentioned.Message ID: @.***>

ybai62868 commented 7 months ago

Hi! @zhanglx13 Recently, I use the four MI210 AMD GPUs to tuning the performance of the GEMM. I set the "--ngpus 4" and use ”rocm-smi“ to watch the status of each GPU. I only find the first GPU namely id = 0 is running and other 3 GPUs maybe idle. Am I right? I need multiple cards to accelerate the tuning process.

Look forward to your reply!

Thanks,

zhanglx13 commented 7 months ago

@ybai62868 Recently I've added --jobs to indicated how to partition the tuning space. The default value is 1. You can try to use --jobs 8 --ngpus 4. I usually set --jobs a multiple of ngpus.

ybai62868 commented 7 months ago

Thanks, I'll check them out.

Lixun Zhang @.***> 于2024年1月29日周一 22:16写道：

@ybai62868 https://github.com/ybai62868 Recently I've added --jobs to indicated how to partition the tuning space. The default value is 1. You can try to use --jobs 8 --ngpus 4. I usually set --jobs a multiple of ngpus.

— Reply to this email directly, view it on GitHub https://github.com/ROCmSoftwarePlatform/triton/issues/447#issuecomment-1914784431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWEFY5VWMF3NXIUEH4TYQ6VNPAVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUG44DINBTGE . You are receiving this because you were mentioned.Message ID: @.***>

ybai62868 commented 7 months ago

Hi @zhanglx13,

Another questions, i use the --o to specify the output yaml files to store all of the results in the performance tuning. Because I find it takes me very long time to do the tuning, I have finished lots of the GEMM sizes. But when I open the files, i find nothing on the yaml files that I mentioned before. I re-take the experiments with limited sizes. I find that only all of the experiments done and we can get the final output yaml files in the end. But I want to know, how to recover the results I have done before? I think they may store in some places that I can find them manually.

Thanks for your information!

zhanglx13 commented 7 months ago

@ybai62868 The tuning results are written to the output file only when the tuning finishes, or you terminate the tuning by Ctrl+C. The file is kept opening before the whole tuning process is done so you cannot see anything in the file. Another way is to redirect the stdout to some log file, like

python tune_gemm.py --gemm_size_file xxx.yaml --jobs 24 --ngpus 8 --compare | tee -a log.txt

ybai62868 commented 7 months ago

Thanks for your help! And I also find that "config: matmul_kernel_M983_N5120_K5120_BM256_BN256_BK64_GM1_SK1_nW8_nS0_EU4" can not work. It tunes very long time but it does not work. Is it a bug for the code without considering some conditions in config pruning?

ybai62868 commented 7 months ago

Another questions in your new scripts.

ybai62868 commented 7 months ago

raw_data = init_by_size_and_type((N,M) if needTrans else (M,N), torch.float32, init_type)
if needTrans:
    raw_data = raw_data.T

ybai62868 commented 7 months ago

It seems that the matrix is not transposed?

ybai62868 commented 7 months ago

Sorry to brother you again! I also want to know the meaning of T and N for rowMajowA. Because I run some cases with NN not TN. The program will hang for a very long time! Thanks a lot!

zhanglx13 commented 7 months ago

And I also find that "config: matmul_kernel_M983_N5120_K5120_BM256_BN256_BK64_GM1_SK1_nW8_nS0_EU4" can not work

What do you mean by "it does not work"? First of all, if you want to tune a gem size M983xN5120xK5120, you only need to provide the gemm size like

python tune_gemm.py -m 983 -n 5120 -k 5120

And the script will tell you how many configs are there in the pruned tuning space. Here is what I got

SIZE: 983 5120 5120 TN nConfigs: 1912

Given such large tuning space and large gemm size, it is expected for the tuning to run a while. I used --jobs 64 --ngpus 16 on a MI250X and here is what I got.

Tuning 1 gemm sizes starts at: 2024-01-31 08:18:13.552302                                                                                                                                                                                                                           
SIZE: 983 5120 5120 TN nConfigs: 1912 TFLOPS: 105.95 time(us): 486.43 best_config: M983_N5120_K5120_BM128_BN128_BK64_GM4_SK1_nW8_nS0_EU0_mfma32 correctness:   Correct✅                                                                                                            
>>> Elapsed time: 0:17:30.312429 = 0:13:42.122296 (compile) + 0:03:40.977749 (profile) + 0:00:04.906987 (post processing)                                                                                                                                                           
Tuning ends at: 2024-01-31 08:35:43.865217                                                                                                                                                                                                                                          
Total tuning time (h:m:s): 0:17:30.312915

It takes 17 minutes with 16 gpus. If you only have 4 gpus, it is not unreasonable for the tuning to take 1 hour.

If you want to check the correctness of this config, you need to put the following in a yaml file

- {'M': 983, 'N': 5120, 'K': 5120, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 32}

And check correctness like

python tune_gemm.py --gemm_size_file xxx.yaml --compare_wo_tuning

It seems that the matrix is not transposed?

It's very tricky when it comes to transpose. raw_data.T does not transpose the data in memory. If it was MxN and row major, i.e. N dim is the contiguous dim, it will be NxM and col major since N dim is still the contiguous dim. In the script, we use raw_data.T to change the order of the tensor without changing data in memory.

I also want to know the meaning of T and N for rowMajowA

It means matrix A is row major if T, and column major if N We always assume matrix A has shape M x K and matrix B has shape K x N, since this is how we do matrix multiplication on paper. Then rowMajorA and rowMajorB tell you which dimension is the contiguous dimension in memory.

The program will hang for a very long time

Other cases are expected to be slower than TN. It seems that you always run into the hanging issue. I usually do the following if it takes longer than I expect

turn on --verbose to check the progress of the tuning process
Increase --jobs to make each generated file "smaller". This is necessary since rocprof can hang if the file is too "large". Here small and large are referring to the time it takes to execute the kernels in the file
Reduce the tuning space to make sure everything is not stuck.

ybai62868 commented 7 months ago

Thanks for your information. I will try it one by one.

Lixun Zhang @.***> 于2024年1月31日周三 22:43写道：

And I also find that "config: matmul_kernel_M983_N5120_K5120_BM256_BN256_BK64_GM1_SK1_nW8_nS0_EU4" can not work

What do you mean by "it does not work"? First of all, if you want to tune a gem size M983xN5120xK5120, you only need to provide the gemm size like

python tune_gemm.py -m 983 -n 5120 -k 5120

And the script will tell you how many configs are there in the pruned tuning space. Here is what I got

SIZE: 983 5120 5120 TN nConfigs: 1912

Given such large tuning space and large gemm size, it is expected for the tuning to run a while. I used --jobs 64 --ngpus 16 on a MI250X and here is what I got.

Tuning 1 gemm sizes starts at: 2024-01-31 08:18:13.552302 SIZE: 983 5120 5120 TN nConfigs: 1912 TFLOPS: 105.95 time(us): 486.43 best_config: M983_N5120_K5120_BM128_BN128_BK64_GM4_SK1_nW8_nS0_EU0_mfma32 correctness: Correct✅

Elapsed time: 0:17:30.312429 = 0:13:42.122296 (compile) + 0:03:40.977749 (profile) + 0:00:04.906987 (post processing) Tuning ends at: 2024-01-31 08:35:43.865217 Total tuning time (h:m:s): 0:17:30.312915

It takes 17 minutes with 16 gpus. If you only have 4 gpus, it is not unreasonable for the tuning to take 1 hour.

If you want to check the correctness of this config, you need to put the following in a yaml file

{'M': 983, 'N': 5120, 'K': 5120, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 32}

And check correctness like

python tune_gemm.py --gemm_size_file xxx.yaml --compare_wo_tuning

It seems that the matrix is not transposed?

It's very tricky when it comes to transpose. raw_data.T does not transpose the data in memory. If it was MxN and row major, i.e. N dim is the contiguous dim, it will be NxM and col major since N dim is still the contiguous dim. In the script, we use raw_data.T to change the order of the tensor without changing data in memory.

I also want to know the meaning of T and N for rowMajowA

It means matrix A is row major if T, and column major if N We always assume matrix A has shape M x K and matrix B has shape K x N, since this is how we do matrix multiplication on paper. Then rowMajorA and rowMajorB tell you which dimension is the contiguous dimension in memory.

The program will hang for a very long time

Other cases are expected to be slower than TN. It seems that you always run into the hanging issue. I usually do the following if it takes longer than I expect

turn on --verbose to check the progress of the tuning process

Increase --jobs to make each generated file "smaller". This is necessary since rocprof can hang if the file is too "large". Here small and large are referring to the time it takes to execute the kernels in the file

Reduce the tuning space to make sure everything is not stuck.

— Reply to this email directly, view it on GitHub https://github.com/ROCm/triton/issues/447#issuecomment-1919244776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWDCEC4YY25MZQTXGTDYRJKA5AVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGI2DINZXGY . You are receiving this because you were mentioned.Message ID: @.***>

ybai62868 commented 7 months ago

Hi,

"It's very tricky when it comes to transpose. raw_data.T does not transpose the data in memory. If it was MxN and row major, i.e. N dim is the contiguous dim, it will be NxM and col major since N dim is still the contiguous dim. In the script, we use raw_data.T to change the order of the tensor without changing data in memory."

I still can not understand this sentence. What means raw_data.t changes the order of the tensor without changing data in memory？

Yang Bai @.***> 于2024年2月1日周四 22:11写道：

Thanks for your information. I will try it one by one.

Lixun Zhang @.***> 于2024年1月31日周三 22:43写道：

And I also find that "config: matmul_kernel_M983_N5120_K5120_BM256_BN256_BK64_GM1_SK1_nW8_nS0_EU4" can not work

What do you mean by "it does not work"? First of all, if you want to tune a gem size M983xN5120xK5120, you only need to provide the gemm size like

python tune_gemm.py -m 983 -n 5120 -k 5120

And the script will tell you how many configs are there in the pruned tuning space. Here is what I got

SIZE: 983 5120 5120 TN nConfigs: 1912

Given such large tuning space and large gemm size, it is expected for the tuning to run a while. I used --jobs 64 --ngpus 16 on a MI250X and here is what I got.

Tuning 1 gemm sizes starts at: 2024-01-31 08:18:13.552302 SIZE: 983 5120 5120 TN nConfigs: 1912 TFLOPS: 105.95 time(us): 486.43 best_config: M983_N5120_K5120_BM128_BN128_BK64_GM4_SK1_nW8_nS0_EU0_mfma32 correctness: Correct✅

Elapsed time: 0:17:30.312429 = 0:13:42.122296 (compile) + 0:03:40.977749 (profile) + 0:00:04.906987 (post processing) Tuning ends at: 2024-01-31 08:35:43.865217 Total tuning time (h:m:s): 0:17:30.312915

It takes 17 minutes with 16 gpus. If you only have 4 gpus, it is not unreasonable for the tuning to take 1 hour.

If you want to check the correctness of this config, you need to put the following in a yaml file

{'M': 983, 'N': 5120, 'K': 5120, 'rowMajorA': 'T', 'rowMajorB': 'N', 'BLOCK_SIZE_M': 256, 'BLOCK_SIZE_N': 256, 'BLOCK_SIZE_K': 64, 'GROUP_SIZE_M': 1, 'SPLIT_K': 1, 'num_warps': 8, 'num_stages': 0, 'waves_per_eu': 0, 'matrix_instr_nonkdim': 32}

And check correctness like

python tune_gemm.py --gemm_size_file xxx.yaml --compare_wo_tuning

It seems that the matrix is not transposed?

It's very tricky when it comes to transpose. raw_data.T does not transpose the data in memory. If it was MxN and row major, i.e. N dim is the contiguous dim, it will be NxM and col major since N dim is still the contiguous dim. In the script, we use raw_data.T to change the order of the tensor without changing data in memory.

I also want to know the meaning of T and N for rowMajowA

It means matrix A is row major if T, and column major if N We always assume matrix A has shape M x K and matrix B has shape K x N, since this is how we do matrix multiplication on paper. Then rowMajorA and rowMajorB tell you which dimension is the contiguous dimension in memory.

The program will hang for a very long time

Other cases are expected to be slower than TN. It seems that you always run into the hanging issue. I usually do the following if it takes longer than I expect

turn on --verbose to check the progress of the tuning process

Increase --jobs to make each generated file "smaller". This is necessary since rocprof can hang if the file is too "large". Here small and large are referring to the time it takes to execute the kernels in the file

Reduce the tuning space to make sure everything is not stuck.

— Reply to this email directly, view it on GitHub https://github.com/ROCm/triton/issues/447#issuecomment-1919244776, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFISWDCEC4YY25MZQTXGTDYRJKA5AVCNFSM6AAAAABBS7OYCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJZGI2DINZXGY . You are receiving this because you were mentioned.Message ID: @.***>

zhanglx13 commented 7 months ago

Let say you have a tensor A with shape 2 x 4

[1, 2, 3, 4
 5, 6, 7, 8]

This is the view of the tensor in math. But in memory, elements of the tensor can be stored in different order. For row major, in memory it's like

1,2,3,4,5,6,7,8  <-- 7 has offset 6 in memory

For column major, it's like

1,5,2,6,3,7,4,8  <-- 7 has offset 5 in memory

When we access element in the tensor, we usually do A[1,2], which means we want to access the element in (row 1, col 2) of the tensor, which is 7 in the example. But in memory, since it's always 1D, there is no concept of row and col. It needs to translate [row 1, col 2] into an offset and then access the element by address_of_A + offset.

Now it makes a difference between row major and col major when it comes to offset computation. For row major, offset(row 1, col 2) = 1 x 4 + 2 = 6 For column major, offset(row 1, col 2) = 2 * 2 + 1 = 5 In general, we can introduce the concept of stride to combine the above two formula as offset(row, col) = row x row_stride + col x col_stride Therefore, when we compute the 1D offset, we don't need to ask for row major or col major, we only need to know the stride for each dimension. Stride means the distance between two consecutive elements along the dimension in memory. Let's take row major as an example. Elements A[0,0] and A[0, 1] are consecutive elements along the col dim. And their distance is 1 when in memory. So the col_stride=1 in row major case. Element A[0,0] and A[1, 0] are consecutive elements along the row dim. And their distance is 4 in memory. So row_stride=4 in row major case. Now you should be able to figure out the strides for col major case.

Now we can talk about transpose. When you transpose a tensor, what you care about is in math, tensor A becomes

[1, 5
 2, 6
 3, 7
 4, 8]

Let's assume the tensor is row major in memory before transpose. And it's strides are [4, 1], which means row_stride=4 and col_stride=1. After transpose, we do not want to move elements around in memory, which means the elements in memory is still

1,2,3,4,5,6,7,8

This is the col major memory layout for the transposed tensor. So it's stride should be [1, 4], which means row_stride=1 and col_stride=4. With the new strides, if you want to access A[2, 1], the offset is 2 row_stride + 1 col_stride = 6, which is 7. So to transpose a tensor, we can just transpose the stride without moving elements in memory.

The last piece is about order. In triton, order means which dimension is the contiguous dim. order = [0,1] means dim 0 is contiguous, which means it is col major.

ROCm / triton

Bug for the scripts/tune_gemm.py #447