cloudcores / CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully :)
MIT License
361 stars 66 forks source link

Some confusions about stall count #4

Closed AndrewBoWen666 closed 2 years ago

AndrewBoWen666 commented 3 years ago

I am trying to interleave the instructions of LD and FFMA for maximum throughput. For LDG instruction, I notice the stall count is always 4, such as: [----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ; [----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4] ; [----:B------:R-:W3:-:S04] /0380/ LDG.E.SYS R52, [R2+0x4] ; [----:B------:R-:W4:-:S04] /0390/ LDG.E.SYS R54, [R2+0x8] ;
[----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ;
[----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ; However, the gloabl load instruction actually holds a variable latency. The instruction using the loaded oprand by LDG has to be synchronized with wrtie barrier instead of stall count. So why does the stall count of LDG is set as 4.

Another confusion is about the stall count for fix latency instruction, such as: [-R--:B------:R-:W-:-:S02] /0540/ FFMA R22, R54, R66.reuse, R22 ; [-R--:B------:R-:W-:-:S02] /0550/ FFMA R20, R56, R66.reuse, R20 ; [-R--:B------:R-:W-:-:S02] /0560/ FFMA R18, R58, R66.reuse, R18 ; [-R--:B------:R-:W-:-:S02] /0570/ FFMA R16, R60, R66.reuse, R16 ; [-R--:B------:R-:W-:-:S02] /0580/ FFMA R14, R62, R66.reuse, R14 ; As we can see the stall count is set as 2. The author of Maxas declares the pipline of a arithmetic instruction is 6 cycles in maxwell arch. However, the author implementing dual issue as following: --:-:-:-:0 FFMA cx02y00, j0Ax02, j0By00, cx02y00; --:-:-:-:1 LDS.U.128 j1Ax00, [readAs + 4x<1128 + 00>]; --:-:-:-:1 FFMA cx02y01, j0Ax02, j0By01, cx02y01; --:-:-:-:0 FFMA cx00y01, j0Ax00, j0By01, cx00y01; --:-:-:-:1 LDS.U.128 j1By00, [readBs + 4x<1128 + 00>]; --:-:-:-:1 FFMA cx00y00, j0Ax00, j0By00, cx00y00; --:-:-:-:0 FFMA cx03y00, j0Ax03, j0By00, cx03y00; --:-:-:-:1 LDS.U.128 j1Ax64, [readAs + 4x<1128 + 64>]; --:-:-:-:1 FFMA cx03y01, j0Ax03, j0By01, cx03y01; --:-:-:-:0 FFMA cx01y01, j0Ax01, j0By01, cx01y01; --:2:1:-:2 LDS.U.128 j1By64, [readBs + 4x<1128 + 64>]; // Set Dep 1,2 Stall 2

02:-:-:-:1 IADD readAs, readAs, 16; // Wait Dep 2 --:-:-:-:1 IADD readBs, readBs, 16;

01:-:-:-:1 FFMA cx02y00, j1Ax02, j1By00, cx02y00; // Wait Dep 1 cited from Maxas I understand the first 0 stall count is for dual issue. After a dual issue, there is a FFMA instruction(3rd instruction) followed by LDS instruction, and its stall count is 1, following by this FFMA another dual issue is triggered. My understanding is this 1 stall count is for issue this instruction and all the instruction shown in this code are stored in instruction buffer without wating them to finish. The doubtion is, in my assembly code, I don't use such strategy, so why the stall count is set as 2? Is this implies the latency of FFMA is 2 clock cycles. If this is not the case, how does a program ensure the FFMA computation is finish and safe to use in the following inscturion? In some code, like: [----:B------:R-:W-:-:S02] /0470/ IADD3 R44, R44, 0x8, RZ ; [----:B------:R-:W-:-:S02] /0480/ ISETP.GE.AND P0, PT, R86, 0x80, PT ; [----:B------:R-:W-:-:S01] /0490/ IADD3 R46, R46, 0x8, RZ ; [----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ; You can see for a fix latency instruction, such as IADD3 in this code, the stall count of the instruction is also varied in different scenario. For stall count as 1, I would assume this different is to interleave the instruction issue. But agian, how should a program make sure the oprand is ready for following instruction.

cloudcores commented 3 years ago

Some knowledge on the microarchitecture may be needed to understand the control codes.

  1. Maxwell/Pascal has dual issue capability(2 dispatch unit per warp scheduler), thus stall 0 means the instruction can be dual issued with next one, providing they are utilizing different dispatch ports. Volta/Turing/Ampere has no dual issue(1 dispatch unit per warp scheduler), thus at least stall 1 is required for every instruction(except padding NOP). You may refer to this for more details.
  2. Every instruction has its maximum throughput. For LDG, it is four cycle per instruction, thus it should be stalled at least 4 cycles before issueing another LDG/LDS/LDL(they share the same dispatch port!). In Turing, FFMA is 2 cycles per instruction(32 lanes per warp, but only 16 FFMA function units per scheduler), thus an FFMA should be stalled 2 cycle before issueing another FFMA. For SM8.6 and Maxwell/Pascal, it's 1 cycle per instruction. LDG and FFMA will utilize different dispatch port, thus for Turing/Ampere, an LDG followed by FFMA can be issued just in next cycle (stall 1), and for Maxwell/Pascal they can be dual issued (stall 0).
  3. The stall count is only for issueing, it does not mean the result is ready. Variable latency instruction relies on scoreboard(called dependency barrier in Maxas) to prevent from data hazard. Fixed latency instructions should be stalled enough cycles before the result can be used(some other instructions can be inserted before the result is used, for better ILP).

For example, in Turing, FFMA has a latency of 4 cycles, thus the result can be used safely is this way:

:S02 FFMA R0, R1, R2, R3; :S02 FFMA R4, R5, R6, R7; // R0 not ready :S** FFMA R8, R0, ... // R0 ready, since at least 4 cycles are stalled (S02 + S02)

In your final example:

[----:B------:R-:W-:-:S02] /0470/ IADD3 R44, R44, 0x8, RZ ; [----:B------:R-:W-:-:S02] /0480/ ISETP.GE.AND P0, PT, R86, 0x80, PT ; [----:B------:R-:W-:-:S01] /0490/ IADD3 R46, R46, 0x8, RZ ; [----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ;

IADD3/ISETP share the same dispatch port, but FFMA uses another one, thus IADD3 can be stalled 1 cycle before issueing another FFMA.

AndrewBoWen666 commented 3 years ago

@cloudcores Thanks for reply. In the Maxas code I post in my issue. The last 4 instructions: --:2:1:-:2 LDS.U.128 j1By64, [readBs + 4x<1128 + 64>]; // Set Dep 1,2 Stall 2 02:-:-:-:1 IADD readAs, readAs, 16; // Wait Dep 2 --:-:-:-:1 IADD readBs, readBs, 16; 01:-:-:-:1 FFMA cx02y00, j1Ax02, j1By00, cx02y00; // Wait Dep 1 utilize the read and write barrier, is this work as a synchronization to guarantee the previous instructions are finish?

I read the link of your Zhihu article in your comment. Regarding the discussion about dependency barrier in the comments of your article, I also have some confusions hoping to make clear. In my assembly code, the dependency barrier is declared as such: [----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ; [----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4] ; [----:B------:R-:W3:-:S04] /0380/ LDG.E.SYS R52, [R2+0x4] ; [----:B------:R-:W4:-:S04] /0390/ LDG.E.SYS R54, [R2+0x8] ; [----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ; [----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ; [----:B------:R-:W5:-:S04] /03c0/ LDG.E.SYS R60, [R2+0x14] ; [----:B------:R-:W5:-:S04] /03d0/ LDG.E.SYS R62, [R2+0x18] ; [----:B------:R-:W5:-:S04] /03e0/ LDG.E.SYS R64, [R2+0x1c] ; [----:B------:R-:W5:-:S04] /03f0/ LDG.E.SYS R66, [R4+0x4] ; [----:B------:R-:W5:-:S04] /0400/ LDG.E.SYS R68, [R4+0x8] ; [----:B------:R-:W5:-:S04] /0410/ LDG.E.SYS R70, [R4+0xc] ; [----:B------:R-:W5:-:S04] /0420/ LDG.E.SYS R72, [R4+0x10] ; The last few write barriers are all set to 5. If I write a instruction like this: [----:B-----5:R-:W-:Y:S02] /04d0/ FFMA R36, R50, R56, R36 ; Does this mean FFMA instruction 04d0 cannot be computed until all the instrutions who set SB5 are finish, or, this FFMA can be computed as long as the instruction [----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ; is finish? Also, why do not these instructions just use one scoreboard as they are all same type instructions?

You also mention the instruction like: DEPBAR.LE SB0, 0x0, {2,1} ; Is this instruction complete? If DEPBAR.LE SB0, 0x0 is wating the SB0 count to 0, what does {2,1} mean?

cloudcores commented 2 years ago

I'm affraid some statements in maxas wiki may be a little misleading.

For read/write scoreboard (or read/write dependency barrier in maxas), such as :

[B------:R0:W1] LDG R4, [R0]; // set read scoreboard 0, and write scoreboard 1 [B0-----:R-:W-] IADD3 R0, R0, 0x1, RZ; // wait SB0, which means R0 can be safely modified [B-1----:R-:W-] IADD3 R4, R4, 0x1, RZ; // wait SB1, which means R4 already got the loaded value

The LDG may not cache the R0 during issueing, thus R0 cannot be modified before the value is actually sent to load store unit. The read scoreboard will guarantee the R0 for LDG is already used and can be overwritten. And the write scoreboard will guarantee the result of loading have actually ready in R4 and can be used.

A scoreboard is not a boolean value, it has an integer count number. For every instruction setting the scoreboard, the count will increment, for every operation done, the count will decrement until 0. Some instructions will not modify the scoreboard.

Thus for a sequence of LDG like this:

[----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ; // SB5->1 [----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ; // SB5->2 [----:B------:R-:W5:-:S04] /03c0/ LDG.E.SYS R60, [R2+0x14] ; // SB5->3

DEPBAR.LE SB5, 0x2; // R56 ready, SB5->2 IADD3 R0, R56, ...

DEPBAR.LE SB5, 0x1; // R58 ready, SB5->1 ...

DEPBAR.LE SB5, 0x0; // R60 ready, SB5->0 ...

Then if the first LDG finishes(which means R56 ready), the count of SB5 will decrement back to 2. Thus DEPBAR.LE SB0, 0x2; will guarantee the completeness of first load.

If you need to wait a set of scoreboards, you may append the set at the last, such as:

DEPBAR.LE SB0, 0x0, {2,1} ;

this means you should wait the scoreboard SB0, SB1, SB2 until their count all less or equal 0x0. Actually waiting less or equal 0 is just the semantics of inline scoreboard resolving such as B012---:... Thus DEPBAR is usually used when you need to wait a non-zero count of scoreboard.

Well, some details are not disclosed officially, those are all my personal opinions~

AndrewBoWen666 commented 2 years ago

@cloudcores So based on your statment. If the code like this: [----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ; // SB5->1 [----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ; // SB5->2 [----:B------:R-:W5:-:S04] /03c0/ LDG.E.SYS R60, [R2+0x14] ; // SB5->3 [----:B-----5:R-:W-:-:S02] /03d0/ FFMA R36, R50, R56, R36 ; //???Wait until SB5->0??? Is the FFMA can not be computed untill the scoreboard 5 is down to the 0?

cloudcores commented 2 years ago

Yes, I think so.

Since the scoreboard will not increment unless another relevant instruction is issued, resolving it before the result actually used is also valid. That's why I said some of maxas statements is a little misleading (depending on how you interprete it).

[----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ; // SB5->1
[----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ; // SB5->2
[----:B------:R-:W5:-:S04] /03c0/ LDG.E.SYS R60, [R2+0x14] ; // SB5->3
[----:B-----5:R-:W-:-:S02]             FFMA ....                              // Wait until SB5->0, but result may not used
[----:B-----5:R-:W-:-:S02] /03d0/ FFMA R36, R50, R56, R36 ; // SB5 is already 0, safe to use

Thus in maxas code:

--:2:1:-:2      LDS.U.128 j1By64, [readBs + 4x<1*128 + 64>]; // Set Dep 1,2  Stall 2 

02:-:-:-:1      IADD readAs, readAs, 16; // Wait Dep 2
--:-:-:-:1      IADD readBs, readBs, 16;

01:-:-:-:1      FFMA cx02y00, j1Ax02, j1By00, cx02y00; // Wait Dep 1

The first IADD does not use readBs at all, and FFMA also does not use j1By64, but it's still valid code.

I'm not aware why it's done this way, but I think resolving the scoreboard just before it matters will give it more time to complete, and then make current warp less likely to stall.

dongxiao92 commented 2 years ago

DEPBAR.LE SB5, 0x2; // R56 ready, SB5->2

Hi, thanks for your demonstration. I'm a little confused about the change of scoreboard here. After the 3 LDG instructions have been issued, the SB5 will be 3. And if the last LDG finishes first due to its hit in L1/L2 cache and misses for the other two LDGs, then SB5 will decrement to 2. Thus, the first barrier will be resolved and the next IADD instruction can be executed. But actually, the source operand R56 is not ready yet. If this situation cannot happen, does this indicates LDGs will finish in the order where they are issued?

cloudcores commented 2 years ago

Since none of these issues are officially disclosed, there is no way to guarantee the correctness unless you try it by yourself...

As far as I know, many of the load/store instructions (of same kind) are in order, just for memory consistency. Instructions of different kind are not likely to return in order. But as stated in maxas, S2R loading different registers may also return out of order. This may also subject to change from hardware generation to generation.

According to my observation, NVIDIA official libraries now very rarely utilize DEPBAR with non-zero counts. Thus it seems waiting a scoreboard down to zero is always a right and safe choice, although not always a good one for performance.

dongxiao92 commented 2 years ago

Since none of these issues are officially disclosed, there is no way to guarantee the correctness unless you try it by yourself...

As far as I know, many of the load/store instructions (of same kind) are in order, just for memory consistency. Instructions of different kind are not likely to return in order. But as stated in maxas, S2R loading different registers may also return out of order. This may also subject to change from hardware generation to generation.

According to my observation, NVIDIA official libraries now very rarely utilize DEPBAR with non-zero counts. Thus it seems waiting a scoreboard down to zero is always a right and safe choice, although not always a good one for performance.

Thanks a lot