No effect after adjusting cuasm

AndrewBoWen666 commented 3 years ago

I used vectorAdd [----:B------:R-:W-:-:S06] [----:B------:R-:W0:-:S01] [----:B------:R-:W1:-:S15] [R---:B0-----:R-:W-:-:S01] [R---:B-1----:R-:W-:-:S06] [----:B------:R-:W-:-:S06] [----:B------:R-:W-:Y:S13] [----:B------:R-:W-:Y:S10] [----:B------:R-:W-:-:S13] [R---:B------:R-:W-:-:S01] [----:B------:R-:W-:-:S05] [R---:B------:R-:W-:-:S06] [R---:B------:R-:W-:-:S02] [----:B------:R-:W-:-:S00] [----:B------:R-:W-:-:S06] [----:B------:R-:W-:Y:S02] [----:B------:R-:W5:-:S01] [----:B------:R-:W-:-:S06] [----:B------:R-:W-:-:S07] [----:B-----5:R-:W-:-:S06] [----:B------:R-:W-:-:S02] [----:B------:R-:W-:-:S01] [----:B------:R-:W-:Y:S04] [----:B------:R-:W-:-:S15] from cuda samples for experiment. Follow the steps in User Guide, I got my cuasm. Here is the critical part of my cuasm /0008/ MOV R1, c[0x0][0x20] ; /0010/ S2R R0, SR_CTAID.X ; /0018/ S2R R2, SR_TID.X ; /0028/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ ; /0030/ XMAD R2, R0.reuse, c[0x0] [0x8], R2 ; /0038/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 ; /0048/ ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT ; /0050/ NOP ; /0058/ @P0 EXIT ; /0068/ SHL R6, R0.reuse, 0x2 ; /0070/ SHR R0, R0, 0x1e ; /0078/ IADD R4.CC, R6.reuse, c[0x0][0x140] ; /0088/ IADD.X R5, R0.reuse, c[0x0][0x144] ; /0090/ { IADD R2.CC, R6, c[0x0][0x148] ; /0098/ LDG.E R4, [R4] } /00a8/ IADD.X R3, R0, c[0x0][0x14c] ; /00b0/ LDG.E R2, [R2] ; /00b8/ IADD R6.CC, R6, c[0x0][0x150] ; /00c8/ IADD.X R7, R0, c[0x0][0x154] ; /00d0/ FADD R0, R2, R4 ; /00d8/ FADD R0, R0, 1 ; /00e8/ STG.E [R6], R0 ; /00f0/ NOP ; /00f8/ EXIT ;

I simply modified [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 1 ; to [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 2 ;, but the results didn't change. Is there any possible reason?

Here is my script for compile fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm "--image3=kind=elf,sm=50,file=new_cudatest.sm_50.cubin" "--image3=kind=ptx,sm=50,file=cudatest.ptx" --embedded-fatbin="cudatest.fatbin.c"

gcc -E -x c++ -DCUDACC -DNVCC -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -DCUDACC_VER_MAJOR=10 -DCUDACC_VER_MINOR=2 -DCUDACC_VER_BUILD=89 -include "cuda_runtime.h" "cudatest.cu" -o "cudatest.cpp4.ii"

cudafe++ --c++14 --gnu_version=70500 --allow_managed --unsigned_chars --m64 --parse_templates --gen_c_file_name "cudatest.cudafe1.cpp" --stub_file_name "cudatest.cudafe1.stub.c" --module_id_file_name "cudatest.module_id" "cudatest.cpp4.ii"

gcc -D__CUDA_ARCH__=500 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" "cudatest.cudafe1.cpp" -o "cudatest.o" nvlink --arch=sm_50 --register-link-binaries="cudatest_dlink.reg.c" -m64 "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -cpu-arch=AARCH64 "cudatest.o" -lcudadevrt -o "cudatest_dlink.sm_50.cubin"

fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm -link "--image3=kind=elf,sm=50,file=cudatest_dlink.sm_50.cubin" --embedded-fatbin="cudatest_dlink.fatbin.c" gcc -c -x c++ -DFATBINFILE="\"cudatest_dlink.fatbin.c\"" -DREGISTERLINKBINARYFILE="\"cudatest_dlink.reg.c\"" -I. -DNV_EXTRA_INITIALIZATION= -DNV_EXTRA_FINALIZATION= -DCUDA_INCLUDE_COMPILER_INTERNAL_HEADERS -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -DCUDACC_VER_MAJOR=10 -DCUDACC_VER_MINOR=2 -DCUDACC_VER_BUILD=89 "/usr/local/cuda-10.2/bin/crt/link.stub" -o "cudatest_dlink.o"

g++ -Wl,--start-group "cudatest_dlink.o" "cudatest.o" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "cudatest"

cloudcores commented 3 years ago

What's the compute capability of your GPU?

The above code only works for SM50. A cuda program may embed none to several SM version of compiled cubin. If the running GPU matches one of them, the cubin will be loaded and executed. If none matches, the driver may jit compile a temporary cubin from the embeded PTX. If PTX is not available, the program simply fails.

CuAssembler can only modify the embeded cubin (or an offline cubin which may loaded by driver api), thus if your compute capability version of cubin is not embed in the program, there is no way to hack it offline (unless you can hack the jit PTX compiler!).

You can use the "gencode" option of NVCC to embed your version of cubin. Check NVCC doc for more details.

AndrewBoWen666 commented 3 years ago

My GPU capability is 7.2 , which is Jetson Xavier. I guess this is the reason as you point out. Im going to swap to a SM 50 GPU and try agian. Many thanks.

cloudcores commented 3 years ago

You don't really need to focus on SM50.

Actually, SM50/SM61/SM75/SM86 are all expected to work (hopefully)~ Others don't because I haven't generate the instruction dict yet. SM50 is rather accient and now deprecated officially, you may try a new one...

SM7.2 is quite close to SM7.5, so copying "CuAsm/InsAsmRepos/DefaultInsAsmRepos.sm_75.txt" to "CuAsm/InsAsmRepos/DefaultInsAsmRepos.sm_72.txt" may also work (again...hopefully...).

Another modification you should made is passing -gencode=arch=compute_72,code="sm_72,compute_72" to your NVCC compiler. Then modify the above script accordingly.

AndrewBoWen666 commented 3 years ago

Thanks for your quick response and suggestions. Actually I just have GTX960m which is SM 50 in place. Hopefully I can get a 8.x gpu in near future. : - )

cloudcores / CuAssembler

No effect after adjusting cuasm #2