Open Shaquille-Wu opened 9 months ago
You should run make d2h
first to generate cuasm for corresponding arch first, especially for sm8x.
For ampere (sm8x), nv compiler will associate every load instruction a default cache policy(in a uniform register pair), but not displayed in the assembly. To keep semantic equivalence, we need to hack the cubin to show this register pair. The hacked form of LDG in ampere will be like LDG.E R5, desc[UR4][R22.64] ;
. The original LDG.E R5, [R22.64];
cannot be assembled for sm8x.
Another way to get around this is just removing .64
here. LDG.E R5, [R22] ;
will also work, without utilizing cache policy at all. But you should be cautious since this instruction may not behave the same in cache performance.
ok, thank you, you mean, we can ignore the ".64", if the load instrcution is "LDG.E", is it right? But, why the R22 is R22.64 in cuda official SASS ? you mean "LDG.E R5, [R22]" is equal to "LDG.E R5, [R22.64]", is it right? so, "STG.E [R22], R5" is equal to "STG.E [R22.64], R5" ?
and, I have another example, like this:
LDG.E R5, [R22.64+0x4]
I think it cannot equal to LDG.E R5, [R22+0x4]
,
R22.64 means R22:R23, so "R22 + 0x4" can not equal to "R22.64 + 0x4"
so, how to solve this example, would you like to help me?
64bit addr is specified by the E
in LDG.E
, not R#.64
. Actually I think the functionality of LDG.E R5, [R22]
in sm8x is same as sm7x(both with opcode 0x381). The .64
modifier also appears in sm75 of opcode 0x981, such as LDG.E R5, [R22.64+UR4]
. This addressing mode is also supported by sm8x.
However, for the opcode 0x981, if some bits are turned on, sm8x may also use the UR pair for cache policy(new feature for sm8x), not address base. Thus it's actually LDG.E R5, desc[UR4][R22.64]
. Since the default policy UR#
is not displayed, it will just show LDG.E R5, [R22.64]
in normal nvdisasm
output. This is a dissasmbly issue for nvdisasm, but since its output does not mean be assembled back, nv team is not willing to fix it.
Short for the solution in sm8x:
LDG.E R5, [R22.64] // illegal for sm8x
LDG.E R5, [R22.64+URZ] // opcode 0x981, same as sm7x, no cache policy used
LDG.E R5, [R22] // opcode 0x381, same as sm7x, no chace policy used
LDG.E R5, desc[UR4][R22.64] // opcode 0x981, new in sm8x, utilizing cache policy UR pair
Since CuAssembler always try to follow the original semantics of compiler output, the last form is recommended. You may run make d2h
to check the index of cache policy UR in the result cuasm. CuAsm will check the version of cubin and hack it if necessary, making those missing desc[UR#]
back.
Another comment, in opcode 0x981
, .64
in R#.64
is used to specify 2 regs are used for addressing, .U32
means 1 reg. But this is not for LDG.E R5, [R22]
with no UR base used.
LDG.E.SYS R6, [R6.64+UR4] ;
LDG.E.SYS R15, [R15.U32+UR4] ;
thank you for your clear explanation.
but, I still confused: how to check "the index of cache policy UR in the result cuasm"
1). I found this clause in "dump.cudatest.sm_86.cuasm" after "make d2h", as following:
3148 [B------:R-:W-:Y:S04] /*0120*/ ULDC.64 UR4, c[0x0][0x118] ;
So, I think UR4 will be used by some others.
2). where to find "the index of cache policy UR", which section is it in?
3). I didn't find the keyword "desc" in dump.cudatest.sm_86.cuasm, so, what is "desc"?
Emmm... Maybe it's a version related feature? c[0x0][0x118]
is just the default cache policy for sm8x. Here is an example dumped by CUDA 11.5:
_Z7argtestPiS_S_:
.text._Z7argtestPiS_S_:
[B------:R-:W-:Y:S02] /*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
[B------:R-:W-:-:S01] /*0010*/ IMAD.MOV.U32 R22, RZ, RZ, c[0x0][0x160] ;
[B------:R-:W-:-:S01] /*0020*/ ULDC.64 UR36, c[0x0][0x118] ;
[B------:R-:W-:-:S01] /*0030*/ IMAD.MOV.U32 R23, RZ, RZ, c[0x0][0x164] ;
[B------:R-:W-:Y:S04] /*0040*/ IADD3 R1, R1, -0x28, RZ ;
[B------:R-:W2:-:S01] /*0050*/ LDG.E R5, desc[UR36][R22.64] ;
[B------:R-:W-:-:S02] /*0060*/ MOV R2, 32@lo(flist) ;
[B------:R-:W-:-:S01] /*0070*/ MOV R3, 32@hi(flist) ;
Probably you may need to update CuAssembler or cuda?
yes, I've switch my nvcc from 11.3 into 11.8, everything is ok now, thanks for your patient good night
mhi, big god my CuAssembler raise exception when I test the "TestData" my nvcc is 11.3, and my arch is sm_86 it throw following exception when I executed "make hack":
`2023-09-16 23:27:38,221 - ERROR - Assertion failed in:
and the codes around line2876 in "hack.cudatest.sm_86.cuasm" as following:
`2871 [B------:R-:W-:Y:S02] /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
and line2876 in "hack.cudatest.sm_86.cuasm" as following:
2876 [B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;
it will be ok when I change arch to sm_60/sm_75。 it will throw above exceptions if I change arch to sm_80/86 but, my real hardware is sm_86, so, I cannot pass these exceptions. I don't know how to fix this trouble, would you like to help me to fix this trouble? or, would you like to tell me the reason?