cloudcores / CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully :)
MIT License
361 stars 66 forks source link

Assembling failed (NewModi): Unknown modifiers: ({'2_R.64'}) #18

Open Shaquille-Wu opened 9 months ago

Shaquille-Wu commented 9 months ago

mhi, big god my CuAssembler raise exception when I test the "TestData" my nvcc is 11.3, and my arch is sm_86 it throw following exception when I executed "make hack":

`2023-09-16 23:27:38,221 - ERROR - Assertion failed in:

File hack.cudatest.sm_86.cuasm:2876 :

    [B------:R-:W2:-:S01]         /*0050*/                   LDG.E R5, [R22.64] ;

Error when assembling instruction "[B------:R-:W2:-:S01] LDG.E R5, [R22.64] ;":

    Assembling failed (NewModi): Unknown modifiers: ({'2_R.64'})

Known Records:

    LDG.E R6, [R4] ;

    LDG.E R0, [R4] ;

    LDG.E R25, [R18] ;

    @P1 LDG.E R58, [R58] ;

    @P2 LDG.E.128 R36, [R48] ;

    @P2 LDG.E.128 R68, [R76+0x80] ;

    @P1 LDG.E.64 R96, [R96] ;

    @P0 LDG.E.U16 R38, [R38] ;

    @P0 LDG.E.LTC128B R42, [R88] ;

    @!P0 LDG.E.STRONG.GPU R40, [R34] ;

    LDG.U16.CONSTANT R17, [R17] ;

    LDG.U16.CONSTANT R9, [R9+-0x40] ;

    LDG.U8.CONSTANT R17, [R17] ;

    LDG.U16 R13, [R6] ;

    @P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;

    @P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;`

and the codes around line2876 in "hack.cudatest.sm_86.cuasm" as following:

`2871 [B------:R-:W-:Y:S02] /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;

  2872      [B------:R-:W-:-:S01]         /*0010*/                   IMAD.MOV.U32 R22, RZ, RZ, c[0x0][0x160] ;

  2873      [B------:R-:W-:-:S01]         /*0020*/                   ULDC.64 UR36, c[0x0][0x118] ;

  2874      [B------:R-:W-:-:S01]         /*0030*/                   IMAD.MOV.U32 R23, RZ, RZ, c[0x0][0x164] ;

  2875      [B------:R-:W-:Y:S04]         /*0040*/                   IADD3 R1, R1, -0x28, RZ ;

  2876      [B------:R-:W2:-:S01]         /*0050*/                   LDG.E R5, [R22.64] ;

  2877      [B------:R-:W-:-:S02]         /*0060*/                   MOV R2, 32@lo(flist) ;

  2878      [B------:R-:W-:-:S01]         /*0070*/                   MOV R3, 32@hi(flist) ;

  2879      [B------:R-:W0:-:S04]         /*0080*/                   S2R R17, SR_CTAID.X ;

  2880      [B------:R-:W0:-:S01]         /*0090*/                   S2R R0, SR_TID.X ;

  2881      [B------:R-:W-:-:S02]         /*00a0*/                   IMAD.MOV.U32 R18, RZ, RZ, 0x4 ;`

and line2876 in "hack.cudatest.sm_86.cuasm" as following: 2876 [B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;

it will be ok when I change arch to sm_60/sm_75。 it will throw above exceptions if I change arch to sm_80/86 but, my real hardware is sm_86, so, I cannot pass these exceptions. I don't know how to fix this trouble, would you like to help me to fix this trouble? or, would you like to tell me the reason?

cloudcores commented 9 months ago

You should run make d2h first to generate cuasm for corresponding arch first, especially for sm8x.

For ampere (sm8x), nv compiler will associate every load instruction a default cache policy(in a uniform register pair), but not displayed in the assembly. To keep semantic equivalence, we need to hack the cubin to show this register pair. The hacked form of LDG in ampere will be like LDG.E R5, desc[UR4][R22.64] ;. The original LDG.E R5, [R22.64]; cannot be assembled for sm8x.

Another way to get around this is just removing .64 here. LDG.E R5, [R22] ; will also work, without utilizing cache policy at all. But you should be cautious since this instruction may not behave the same in cache performance.

Shaquille-Wu commented 9 months ago

ok, thank you, you mean, we can ignore the ".64", if the load instrcution is "LDG.E", is it right? But, why the R22 is R22.64 in cuda official SASS ? you mean "LDG.E R5, [R22]" is equal to "LDG.E R5, [R22.64]", is it right? so, "STG.E [R22], R5" is equal to "STG.E [R22.64], R5" ?

and, I have another example, like this: LDG.E R5, [R22.64+0x4] I think it cannot equal to LDG.E R5, [R22+0x4], R22.64 means R22:R23, so "R22 + 0x4" can not equal to "R22.64 + 0x4" so, how to solve this example, would you like to help me?

cloudcores commented 9 months ago

64bit addr is specified by the E in LDG.E, not R#.64. Actually I think the functionality of LDG.E R5, [R22] in sm8x is same as sm7x(both with opcode 0x381). The .64 modifier also appears in sm75 of opcode 0x981, such as LDG.E R5, [R22.64+UR4]. This addressing mode is also supported by sm8x.

However, for the opcode 0x981, if some bits are turned on, sm8x may also use the UR pair for cache policy(new feature for sm8x), not address base. Thus it's actually LDG.E R5, desc[UR4][R22.64]. Since the default policy UR# is not displayed, it will just show LDG.E R5, [R22.64] in normal nvdisasm output. This is a dissasmbly issue for nvdisasm, but since its output does not mean be assembled back, nv team is not willing to fix it.

Short for the solution in sm8x:

LDG.E R5, [R22.64]                 // illegal for sm8x
LDG.E R5, [R22.64+URZ]        // opcode 0x981, same as sm7x, no cache policy used
LDG.E R5, [R22]                      // opcode 0x381, same as sm7x, no chace policy used
LDG.E R5, desc[UR4][R22.64] // opcode 0x981, new in sm8x, utilizing cache policy UR pair

Since CuAssembler always try to follow the original semantics of compiler output, the last form is recommended. You may run make d2h to check the index of cache policy UR in the result cuasm. CuAsm will check the version of cubin and hack it if necessary, making those missing desc[UR#] back.

cloudcores commented 9 months ago

Another comment, in opcode 0x981, .64 in R#.64 is used to specify 2 regs are used for addressing, .U32 means 1 reg. But this is not for LDG.E R5, [R22] with no UR base used.

LDG.E.SYS R6, [R6.64+UR4] ;
LDG.E.SYS R15, [R15.U32+UR4] ;
Shaquille-Wu commented 9 months ago

thank you for your clear explanation. but, I still confused: how to check "the index of cache policy UR in the result cuasm" 1). I found this clause in "dump.cudatest.sm_86.cuasm" after "make d2h", as following: 3148 [B------:R-:W-:Y:S04] /*0120*/ ULDC.64 UR4, c[0x0][0x118] ; So, I think UR4 will be used by some others. 2). where to find "the index of cache policy UR", which section is it in? 3). I didn't find the keyword "desc" in dump.cudatest.sm_86.cuasm, so, what is "desc"?

cloudcores commented 9 months ago

Emmm... Maybe it's a version related feature? c[0x0][0x118] is just the default cache policy for sm8x. Here is an example dumped by CUDA 11.5:

_Z7argtestPiS_S_:
  .text._Z7argtestPiS_S_:
      [B------:R-:W-:Y:S02]         /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
      [B------:R-:W-:-:S01]         /*0010*/                   IMAD.MOV.U32 R22, RZ, RZ, c[0x0][0x160] ;
      [B------:R-:W-:-:S01]         /*0020*/                   ULDC.64 UR36, c[0x0][0x118] ;
      [B------:R-:W-:-:S01]         /*0030*/                   IMAD.MOV.U32 R23, RZ, RZ, c[0x0][0x164] ;
      [B------:R-:W-:Y:S04]         /*0040*/                   IADD3 R1, R1, -0x28, RZ ;
      [B------:R-:W2:-:S01]         /*0050*/                   LDG.E R5, desc[UR36][R22.64] ;
      [B------:R-:W-:-:S02]         /*0060*/                   MOV R2, 32@lo(flist) ;
      [B------:R-:W-:-:S01]         /*0070*/                   MOV R3, 32@hi(flist) ;

Probably you may need to update CuAssembler or cuda?

Shaquille-Wu commented 9 months ago

yes, I've switch my nvcc from 11.3 into 11.8, everything is ok now, thanks for your patient good night