Open bbiiggppiigg opened 3 months ago
The following code snippet is extracted from a kernel in the Rodinia benchmark suite. We can see that 3 loads happen at line 1,2, and 9, and a waitcnt instruction at line 13.
A PC sample at line 13 will tell us that we are waiting for instructions that increase the lgkmcnt, but we can’t tell which load we are waiting on.
If we can expand the single s_waitcnt instruction into 3 s_waitcnt instructions waiting for 2,1 and 0 respectively, we should be able to tell from the PC sample which one of these loads might be the bottleneck.
Note that if 1 is a high latency load, then its exposed latency will shadow some of the latency of the subsequent loads. If we observe waiting for load 1 and no waiting for loads 2 and 9, significantly reducing the latency for load 1 might just expose latency waiting for load 2 and/or load 9.
1 <_Z22bpnn_layerforward_CUDAPfS_S_S_ii>:
2 s_load_dwordx4 s[8:11], s[4:5], 0x10
3 s_load_dword s6, s[4:5], 0x24
4 v_mov_b32_e32 v3, 0x400
5 v_lshl_add_u32 v2, s7, 4, v1
6 v_cmp_eq_u32_e32 vcc, 0, v0
7 v_lshl_add_u32 v4, v1, 2, v3
8 s_and_saveexec_b64 s[2:3], vcc
9 s_cbranch_execz 17
10 s_load_dwordx2 s[0:1], s[4:5], 0x0
11 v_add_u32_e32 v5, 1, v2
12 v_ashrrev_i32_e32 v6, 31, v5
13 v_lshlrev_b64 v[5:6], 2, v[5:6]
14 s_waitcnt lgkmcnt(0)
…
The next example here is a little more complicated / advanced. On Line 7 and Line 14 are two instructions of different types. The global load on line 7 increments the vmcnt, while the scalar load on line 14 increments the lgkmcnt. In this case we can’t really know load will finish faster, but it doesn’t hurt to expand it as s_waitcnt vmcnt(0) followed by s_waitcnt lgkmcnt(0), as either 1) we only hit 15-1, meaning that line 7 is the causing some delay and the delay for line 14 is not observable after we waited for line 7, or if we hit both line 15-1 and 15-2, meaning both loads are potential bottleneck.
1 s_add_u32 s8, s2, s6
2 v_lshlrev_b64 v[0:1], 2, v[0:1]
3 s_addc_u32 s9, s3, s7
4 v_mov_b32_e32 v3, s9
5 v_add_co_u32_e32 v2, vcc, s8, v0
6 v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
7 global_load_dword v2, v[2:3], off
8 s_ashr_i32 s1, s0, 31
9 s_lshl_b64 s[0:1], s[0:1], 2
10 s_add_u32 s0, s6, s0
11 s_addc_u32 s1, s7, s1
12 s_add_u32 s0, s0, s2
13 s_addc_u32 s1, s1, s3
14 s_load_dword s6, s[0:1], 0x0
15 s_waitcnt vmcnt(0) lgkmcnt(0)
...
15-1 s_waitcnt vmcnt(0) ; pass
15-2 s_waitcnt lgkmcnt(0) ; if hit, waiting on line 14
The last example we have here is extracted from the QuickSilver application. Here stores from line 3 to line 14 increases vmcnt, and flat load instruction at line 36 increases both lgkmcnt and vmcnt. We are not entirely sure what is the right way to expand the s_waitcnt instruction at line 38, but it would be nice if we do it in some way that can let us tell apart whether we are waiting for an instruction from line 3 to line 14, or line 36.
According to the document of
1 global_load_dwordx2 v[0:1], v2, s[8:9]
2 s_movk_i32 s8, 0x88
3 buffer_store_dword v2, off, s[0:3], 0 offset:20
4 buffer_store_dword v2, off, s[0:3], 0 offset:16
5 buffer_store_dword v2, off, s[0:3], 0 offset:28
6 buffer_store_dword v2, off, s[0:3], 0 offset:24
7 buffer_store_dword v2, off, s[0:3], 0 offset:36
8 buffer_store_dword v2, off, s[0:3], 0 offset:32
9 buffer_store_dword v2, off, s[0:3], 0 offset:44
10 buffer_store_dword v2, off, s[0:3], 0 offset:40
11 buffer_store_dword v2, off, s[0:3], 0 offset:52
12 buffer_store_dword v2, off, s[0:3], 0 offset:48
13 buffer_store_dword v2, off, s[0:3], 0 offset:60
14 buffer_store_dword v2, off, s[0:3], 0 offset:56
15 v_mov_b32_e32 v82, 0
16 v_cndmask_b32_e32 v5, 0, v5, vcc
17 v_cndmask_b32_e32 v4, 0, v4, vcc
18 v_mov_b32_e32 v83, 0
19 v_mov_b32_e32 v54, v82
20 v_mov_b32_e32 v52, v82
21 v_mov_b32_e32 v63, v82
22 v_mov_b32_e32 v34, v82
23 v_mov_b32_e32 v32, v82
24 v_mov_b32_e32 v55, v83
25 v_mov_b32_e32 v53, v83
26 v_mov_b32_e32 v64, v83
27 v_mov_b32_e32 v35, v83
28 v_mov_b32_e32 v33, v83
29 v_readlane_b32 s10, v135, 2
30 v_readlane_b32 s11, v135, 3
31 s_waitcnt vmcnt(12)
32 v_mad_i64_i32 v[2:3], s[6:7], v66, s8, v[0:1]
33 v_cmp_ne_u64_e32 vcc, v[2:3], v[4:5]
34 s_and_saveexec_b64 s[6:7], vcc
35 s_cbranch_execz BB0_6
36 flat_load_dwordx4 v[32:35], v[2:3]
37 v_mad_i64_i32 v[2:3], s[10:11], v66, s8, v[0:1]
38 s_waitcnt vmcnt(0) lgkmcnt(0)
Tracking internally as [SWDEV-448069].
https://github.com/llvm/llvm-project/pull/79236 overlaps with this.
The only difference, as far as I can tell, is that +precise-memory
will emit zero waitcnts instead of non-zero values
@bbiiggppiigg For the 3 examples provided, could you pls (1) list the desired code with s_waitcnt inserted at desired places with the appropriate counters, and (2) list the bitcode (.ll) ? Thanks!
@jwanggit86 An example for Example 1 would be
1 <_Z22bpnn_layerforward_CUDAPfS_S_S_ii>:
2 s_load_dwordx4 s[8:11], s[4:5], 0x10
3 s_load_dword s6, s[4:5], 0x24
4 v_mov_b32_e32 v3, 0x400
5 v_lshl_add_u32 v2, s7, 4, v1
6 v_cmp_eq_u32_e32 vcc, 0, v0
7 v_lshl_add_u32 v4, v1, 2, v3
8 s_and_saveexec_b64 s[2:3], vcc
9 s_cbranch_execz 17
10 s_load_dwordx2 s[0:1], s[4:5], 0x0
11 v_add_u32_e32 v5, 1, v2
12 v_ashrrev_i32_e32 v6, 31, v5
13 v_lshlrev_b64 v[5:6], 2, v[5:6]
**14 s_waitcnt lgkmcnt(2)
15 s_waitcnt lgkmcnt(1)
16 s_waitcnt lgkmcnt(0)**
An example for Case 2 would be
1 s_add_u32 s8, s2, s6
2 v_lshlrev_b64 v[0:1], 2, v[0:1]
3 s_addc_u32 s9, s3, s7
4 v_mov_b32_e32 v3, s9
5 v_add_co_u32_e32 v2, vcc, s8, v0
6 v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
7 global_load_dword v2, v[2:3], off
8 s_ashr_i32 s1, s0, 31
9 s_lshl_b64 s[0:1], s[0:1], 2
10 s_add_u32 s0, s6, s0
11 s_addc_u32 s1, s7, s1
12 s_add_u32 s0, s0, s2
13 s_addc_u32 s1, s1, s3
14 s_load_dword s6, s[0:1], 0x0
**15 s_waitcnt vmcnt(0)
16 s_waitcnt lgkmcnt(0)**
For case 3, as we've mentioned, we are not sure what is the right expansion order. An example can potentially be the following. There are 12 load/store that affects vmcnt, and 1 load/store that affects lgkmcnt. So we want to expand it into 11 waits on vmcnt and 1 wait on lgkmcnt, and probably any partial order is fine.
1 global_load_dwordx2 v[0:1], v2, s[8:9]
2 s_movk_i32 s8, 0x88
3 buffer_store_dword v2, off, s[0:3], 0 offset:20
4 buffer_store_dword v2, off, s[0:3], 0 offset:16
5 buffer_store_dword v2, off, s[0:3], 0 offset:28
6 buffer_store_dword v2, off, s[0:3], 0 offset:24
7 buffer_store_dword v2, off, s[0:3], 0 offset:36
8 buffer_store_dword v2, off, s[0:3], 0 offset:32
9 buffer_store_dword v2, off, s[0:3], 0 offset:44
10 buffer_store_dword v2, off, s[0:3], 0 offset:40
11 buffer_store_dword v2, off, s[0:3], 0 offset:52
12 buffer_store_dword v2, off, s[0:3], 0 offset:48
13 buffer_store_dword v2, off, s[0:3], 0 offset:60
14 buffer_store_dword v2, off, s[0:3], 0 offset:56
15 v_mov_b32_e32 v82, 0
16 v_cndmask_b32_e32 v5, 0, v5, vcc
17 v_cndmask_b32_e32 v4, 0, v4, vcc
18 v_mov_b32_e32 v83, 0
19 v_mov_b32_e32 v54, v82
20 v_mov_b32_e32 v52, v82
21 v_mov_b32_e32 v63, v82
22 v_mov_b32_e32 v34, v82
23 v_mov_b32_e32 v32, v82
24 v_mov_b32_e32 v55, v83
25 v_mov_b32_e32 v53, v83
26 v_mov_b32_e32 v64, v83
27 v_mov_b32_e32 v35, v83
28 v_mov_b32_e32 v33, v83
29 v_readlane_b32 s10, v135, 2
30 v_readlane_b32 s11, v135, 3
31 s_waitcnt vmcnt(12)
32 v_mad_i64_i32 v[2:3], s[6:7], v66, s8, v[0:1]
33 v_cmp_ne_u64_e32 vcc, v[2:3], v[4:5]
34 s_and_saveexec_b64 s[6:7], vcc
35 s_cbranch_execz BB0_6
36 flat_load_dwordx4 v[32:35], v[2:3]
37 v_mad_i64_i32 v[2:3], s[10:11], v66, s8, v[0:1]
**38 s_waitcnt vmcnt(11)
39 s_waitcnt vmcnt(10)
...
48 s_waitcnt vmcnt(1)
49 s_waitcnt vmcnt(0)
50 s_lgkmcnt(0)**
As for the bitcode, I tried emit-llvm to generate llvm bitcode and llvm-dis them into llvm ir, but I don't see any waitcnt instructions within it.
Suggestion Description
From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.
Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.
In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.
Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.
Operating System
No response
GPU
MI200 / MI250 / MI300
ROCm Component
No response