[Feature]: Compiler option for expanding s_waitcnt instructions or don’t merge them in the first place

bbiiggppiigg commented 3 months ago

Suggestion Description

From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.

Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.

In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.

Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.

Operating System

No response

GPU

MI200 / MI250 / MI300

ROCm Component

No response

bbiiggppiigg commented 3 months ago

Examples

Case 1 : Single Counter

The following code snippet is extracted from a kernel in the Rodinia benchmark suite. We can see that 3 loads happen at line 1,2, and 9, and a waitcnt instruction at line 13.

A PC sample at line 13 will tell us that we are waiting for instructions that increase the lgkmcnt, but we can’t tell which load we are waiting on.

If we can expand the single s_waitcnt instruction into 3 s_waitcnt instructions waiting for 2,1 and 0 respectively, we should be able to tell from the PC sample which one of these loads might be the bottleneck.

Note that if 1 is a high latency load, then its exposed latency will shadow some of the latency of the subsequent loads. If we observe waiting for load 1 and no waiting for loads 2 and 9, significantly reducing the latency for load 1 might just expose latency waiting for load 2 and/or load 9.

  1 <_Z22bpnn_layerforward_CUDAPfS_S_S_ii>:
  2   s_load_dwordx4 s[8:11], s[4:5], 0x10
  3   s_load_dword s6, s[4:5], 0x24
  4   v_mov_b32_e32 v3, 0x400
  5   v_lshl_add_u32 v2, s7, 4, v1
  6   v_cmp_eq_u32_e32 vcc, 0, v0
  7   v_lshl_add_u32 v4, v1, 2, v3
  8   s_and_saveexec_b64 s[2:3], vcc
  9   s_cbranch_execz 17
 10   s_load_dwordx2 s[0:1], s[4:5], 0x0
 11 v_add_u32_e32 v5, 1, v2
 12 v_ashrrev_i32_e32 v6, 31, v5
 13 v_lshlrev_b64 v[5:6], 2, v[5:6]
 14 s_waitcnt lgkmcnt(0)                                          
…

Case 2 : Independent Counters

The next example here is a little more complicated / advanced. On Line 7 and Line 14 are two instructions of different types. The global load on line 7 increments the vmcnt, while the scalar load on line 14 increments the lgkmcnt. In this case we can’t really know load will finish faster, but it doesn’t hurt to expand it as s_waitcnt vmcnt(0) followed by s_waitcnt lgkmcnt(0), as either 1) we only hit 15-1, meaning that line 7 is the causing some delay and the delay for line 14 is not observable after we waited for line 7, or if we hit both line 15-1 and 15-2, meaning both loads are potential bottleneck.

1  s_add_u32 s8, s2, s6
2  v_lshlrev_b64 v[0:1], 2, v[0:1]
3  s_addc_u32 s9, s3, s7
4  v_mov_b32_e32 v3, s9
5  v_add_co_u32_e32 v2, vcc, s8, v0
6  v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
7  global_load_dword v2, v[2:3], off
8  s_ashr_i32 s1, s0, 31
9  s_lshl_b64 s[0:1], s[0:1], 2
10 s_add_u32 s0, s6, s0
11 s_addc_u32 s1, s7, s1
12 s_add_u32 s0, s0, s2
13 s_addc_u32 s1, s1, s3
14 s_load_dword s6, s[0:1], 0x0
15 s_waitcnt vmcnt(0) lgkmcnt(0)
...

15-1 s_waitcnt vmcnt(0)   ; pass 
15-2 s_waitcnt lgkmcnt(0) ; if hit, waiting on line 14

Case 3 : Overlapping Counters

The last example we have here is extracted from the QuickSilver application. Here stores from line 3 to line 14 increases vmcnt, and flat load instruction at line 36 increases both lgkmcnt and vmcnt. We are not entirely sure what is the right way to expand the s_waitcnt instruction at line 38, but it would be nice if we do it in some way that can let us tell apart whether we are waiting for an instruction from line 3 to line 14, or line 36.

According to the document of

 1 global_load_dwordx2 v[0:1], v2, s[8:9]
  2 s_movk_i32 s8, 0x88
  3 buffer_store_dword v2, off, s[0:3], 0 offset:20
  4 buffer_store_dword v2, off, s[0:3], 0 offset:16
  5 buffer_store_dword v2, off, s[0:3], 0 offset:28
  6 buffer_store_dword v2, off, s[0:3], 0 offset:24
  7 buffer_store_dword v2, off, s[0:3], 0 offset:36
  8 buffer_store_dword v2, off, s[0:3], 0 offset:32
  9 buffer_store_dword v2, off, s[0:3], 0 offset:44
 10 buffer_store_dword v2, off, s[0:3], 0 offset:40
 11 buffer_store_dword v2, off, s[0:3], 0 offset:52
 12 buffer_store_dword v2, off, s[0:3], 0 offset:48
 13 buffer_store_dword v2, off, s[0:3], 0 offset:60
 14 buffer_store_dword v2, off, s[0:3], 0 offset:56
 15 v_mov_b32_e32 v82, 0
 16 v_cndmask_b32_e32 v5, 0, v5, vcc
 17 v_cndmask_b32_e32 v4, 0, v4, vcc
 18 v_mov_b32_e32 v83, 0
 19 v_mov_b32_e32 v54, v82
 20 v_mov_b32_e32 v52, v82
 21 v_mov_b32_e32 v63, v82
 22 v_mov_b32_e32 v34, v82
 23 v_mov_b32_e32 v32, v82
 24 v_mov_b32_e32 v55, v83
 25 v_mov_b32_e32 v53, v83
 26 v_mov_b32_e32 v64, v83
 27 v_mov_b32_e32 v35, v83
 28 v_mov_b32_e32 v33, v83
 29 v_readlane_b32 s10, v135, 2
 30 v_readlane_b32 s11, v135, 3
 31 s_waitcnt vmcnt(12)
 32 v_mad_i64_i32 v[2:3], s[6:7], v66, s8, v[0:1]
 33 v_cmp_ne_u64_e32 vcc, v[2:3], v[4:5]
 34 s_and_saveexec_b64 s[6:7], vcc
 35 s_cbranch_execz BB0_6
 36 flat_load_dwordx4 v[32:35], v[2:3]
 37 v_mad_i64_i32 v[2:3], s[10:11], v66, s8, v[0:1]
 38 s_waitcnt vmcnt(0) lgkmcnt(0)

t-tye commented 3 months ago

Tracking internally as [SWDEV-448069].

Pierre-vh commented 3 months ago

https://github.com/llvm/llvm-project/pull/79236 overlaps with this. The only difference, as far as I can tell, is that +precise-memory will emit zero waitcnts instead of non-zero values

jwanggit86 commented 2 months ago

@bbiiggppiigg For the 3 examples provided, could you pls (1) list the desired code with s_waitcnt inserted at desired places with the appropriate counters, and (2) list the bitcode (.ll) ? Thanks!

bbiiggppiigg commented 2 months ago

@jwanggit86 An example for Example 1 would be

 1 <_Z22bpnn_layerforward_CUDAPfS_S_S_ii>:
  2   s_load_dwordx4 s[8:11], s[4:5], 0x10
  3   s_load_dword s6, s[4:5], 0x24
  4   v_mov_b32_e32 v3, 0x400
  5   v_lshl_add_u32 v2, s7, 4, v1
  6   v_cmp_eq_u32_e32 vcc, 0, v0
  7   v_lshl_add_u32 v4, v1, 2, v3
  8   s_and_saveexec_b64 s[2:3], vcc
  9   s_cbranch_execz 17
 10   s_load_dwordx2 s[0:1], s[4:5], 0x0
 11 v_add_u32_e32 v5, 1, v2
 12 v_ashrrev_i32_e32 v6, 31, v5
 13 v_lshlrev_b64 v[5:6], 2, v[5:6]
 **14 s_waitcnt lgkmcnt(2)
 15 s_waitcnt lgkmcnt(1)
 16 s_waitcnt lgkmcnt(0)**

An example for Case 2 would be

1  s_add_u32 s8, s2, s6
2  v_lshlrev_b64 v[0:1], 2, v[0:1]
3  s_addc_u32 s9, s3, s7
4  v_mov_b32_e32 v3, s9
5  v_add_co_u32_e32 v2, vcc, s8, v0
6  v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
7  global_load_dword v2, v[2:3], off
8  s_ashr_i32 s1, s0, 31
9  s_lshl_b64 s[0:1], s[0:1], 2
10 s_add_u32 s0, s6, s0
11 s_addc_u32 s1, s7, s1
12 s_add_u32 s0, s0, s2
13 s_addc_u32 s1, s1, s3
14 s_load_dword s6, s[0:1], 0x0
**15 s_waitcnt vmcnt(0) 
16 s_waitcnt lgkmcnt(0)**

For case 3, as we've mentioned, we are not sure what is the right expansion order. An example can potentially be the following. There are 12 load/store that affects vmcnt, and 1 load/store that affects lgkmcnt. So we want to expand it into 11 waits on vmcnt and 1 wait on lgkmcnt, and probably any partial order is fine.

1 global_load_dwordx2 v[0:1], v2, s[8:9]
  2 s_movk_i32 s8, 0x88
  3 buffer_store_dword v2, off, s[0:3], 0 offset:20
  4 buffer_store_dword v2, off, s[0:3], 0 offset:16
  5 buffer_store_dword v2, off, s[0:3], 0 offset:28
  6 buffer_store_dword v2, off, s[0:3], 0 offset:24
  7 buffer_store_dword v2, off, s[0:3], 0 offset:36
  8 buffer_store_dword v2, off, s[0:3], 0 offset:32
  9 buffer_store_dword v2, off, s[0:3], 0 offset:44
 10 buffer_store_dword v2, off, s[0:3], 0 offset:40
 11 buffer_store_dword v2, off, s[0:3], 0 offset:52
 12 buffer_store_dword v2, off, s[0:3], 0 offset:48
 13 buffer_store_dword v2, off, s[0:3], 0 offset:60
 14 buffer_store_dword v2, off, s[0:3], 0 offset:56
 15 v_mov_b32_e32 v82, 0
 16 v_cndmask_b32_e32 v5, 0, v5, vcc
 17 v_cndmask_b32_e32 v4, 0, v4, vcc
 18 v_mov_b32_e32 v83, 0
 19 v_mov_b32_e32 v54, v82
 20 v_mov_b32_e32 v52, v82
 21 v_mov_b32_e32 v63, v82
 22 v_mov_b32_e32 v34, v82
 23 v_mov_b32_e32 v32, v82
 24 v_mov_b32_e32 v55, v83
 25 v_mov_b32_e32 v53, v83
 26 v_mov_b32_e32 v64, v83
 27 v_mov_b32_e32 v35, v83
 28 v_mov_b32_e32 v33, v83
 29 v_readlane_b32 s10, v135, 2
 30 v_readlane_b32 s11, v135, 3
 31 s_waitcnt vmcnt(12)
 32 v_mad_i64_i32 v[2:3], s[6:7], v66, s8, v[0:1]
 33 v_cmp_ne_u64_e32 vcc, v[2:3], v[4:5]
 34 s_and_saveexec_b64 s[6:7], vcc
 35 s_cbranch_execz BB0_6
 36 flat_load_dwordx4 v[32:35], v[2:3]
 37 v_mad_i64_i32 v[2:3], s[10:11], v66, s8, v[0:1]
 **38 s_waitcnt vmcnt(11) 
 39 s_waitcnt vmcnt(10) 
...
 48 s_waitcnt vmcnt(1) 
 49 s_waitcnt vmcnt(0) 
 50 s_lgkmcnt(0)**

As for the bitcode, I tried emit-llvm to generate llvm bitcode and llvm-dis them into llvm ir, but I don't see any waitcnt instructions within it.

ROCm / llvm-project