About instruction count tool

mahmoodn commented 2 years ago

Hi With the histogram tool, I see the following output

----------------------------------------------------------------------------------------------------
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
Instr 2 @ 0x20 (32) - ISETP.NE.AND P0, PT, RZ, c[0x0][0x33c], PT ;
...
Instr 3005 @ 0xbbd0 (48080) - NOP;
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;

kernel 0 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MinNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96,  kernel instructions 450190, total instructions 450190
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
...

I am confused with the order to inspecting and kernel. There is a new line after the last instruction and before kernel 0. So, I wonder which one is kernel 0? _ZN2at6native13... or void at::native::reduce_kernel?

x-y-z commented 2 years ago

Both are kernel 0. inspecting is static instruction info, kernel is dynamic instruction info. inspecting is printed since you enabled verbose mode of the tool.

mahmoodn commented 2 years ago

OK I understand that. However, I want to know why "kernel" is not written for the first one. As you can see below ----- line, it starts with inspecting.

x-y-z commented 2 years ago

If you read the source code of the tool, you will find that inspecting happens before kernel launch and only once for each kernel.

mahmoodn commented 2 years ago

That is correct. According to the source code, the "kernel" statement is written at the end of the kernel instrumentation. Back to my question, it seems that the "empty new line" is not in a good position. I think there should be an empty line after "kernel" and before inspecting. Like this:

Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;
kernel 0 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MinNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96,  kernel instructions 450190, total instructions 450190

inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;

x-y-z commented 2 years ago

I am not sure how you got that new line. By running vectoradd with opcode_hist, I got:

inspecting vecAdd(double*, double*, double*, int) - num instrs 24
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
Instr 2 @ 0x20 (32) - S2R R8, SR_CTAID.X ;
Instr 3 @ 0x30 (48) - S2R R3, SR_TID.X ;
Instr 4 @ 0x40 (64) - IMAD R8, R8, c[0x0][0x0], R3 ;
Instr 5 @ 0x50 (80) - ISETP.GE.AND P0, PT, R8, c[0x0][0x178], PT ;
Instr 6 @ 0x60 (96) - @P0 EXIT ;
Instr 7 @ 0x70 (112) - MOV R9, 0x8 ;
Instr 8 @ 0x80 (128) - IMAD.WIDE R4, R8, R9, c[0x0][0x168] ;
Instr 9 @ 0x90 (144) - IMAD.WIDE R2, R8, R9, c[0x0][0x160] ;
Instr 10 @ 0xa0 (160) - LDG.E.64.SYS R4, [R4] ;
Instr 11 @ 0xb0 (176) - LDG.E.64.SYS R2, [R2] ;
Instr 12 @ 0xc0 (192) - IMAD.WIDE R8, R8, R9, c[0x0][0x170] ;
Instr 13 @ 0xd0 (208) - DADD R6, R4, R2 ;
Instr 14 @ 0xe0 (224) - STG.E.64.SYS [R8], R6 ;
Instr 15 @ 0xf0 (240) - EXIT ;
Instr 16 @ 0x100 (256) - BRA 0x100;
Instr 17 @ 0x110 (272) - NOP;
Instr 18 @ 0x120 (288) - NOP;
Instr 19 @ 0x130 (304) - NOP;
Instr 20 @ 0x140 (320) - NOP;
Instr 21 @ 0x150 (336) - NOP;
Instr 22 @ 0x160 (352) - NOP;
Instr 23 @ 0x170 (368) - NOP;
kernel 0 - vecAdd(double*, double*, double*, int) - #thread-blocks 98,  kernel instructions 50077, total instructions 50077
  DADD = 3125
  EXIT = 6261
  IMAD = 3136
  IMAD.WIDE = 9375
  ISETP.GE.AND = 3136
  LDG.E.64.SYS = 6250
  MOV = 6261
  S2R = 6272
  SHFL.IDX = 3136
  STG.E.64.SYS = 3125
Final sum of Sum_i C[i] = Sum_i (a[i] * b[i]) = 100000.000000

There is no new line between Instr ... and kernel .... It is possible that new line is coming from your application, nothing to do with the tool. And your runs do not have any opcode histogram output. I am not sure if your application was running as expected.

mahmoodn commented 2 years ago

I have to say that I ran instr_count tool and not the histogram. I edited the title.

$ LD_PRELOAD=/home/mahmood/nvbit_release/tools/instr_count/instr_count.so /usr/bin/python3 /home/mahmood/pt/main.py                                                 ------------- NVBit (NVidia Binary Instrumentation Tool v1.5.3) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
      START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
        END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
   ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
       MANGLED_NAMES = 1 - Print kernel names mangled or not
        TOOL_VERBOSE = 1 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
Random Seed:  999
cuda is available
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
Instr 2 @ 0x20 (32) - ISETP.NE.AND P0, PT, RZ, c[0x0][0x33c], PT ;
Instr 3 @ 0x30 (48) - IMAD.MOV.U32 R12, RZ, RZ, c[0x0][0x164] ;
...
Instr 3004 @ 0xbbc0 (48064) - NOP;
Instr 3005 @ 0xbbd0 (48080) - NOP;
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;

kernel 0 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MinNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96,  kernel instructions 450190, total instructions 450190
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
Instr 2 @ 0x20 (32) - ISETP.NE.AND P0, PT, RZ, c[0x0][0x33c], PT ;
...
Instr 3005 @ 0xbbd0 (48080) - NOP;
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;

kernel 1 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MaxNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96,  kernel instructions 450190, total instructions 900380
inspecting void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&, c10::Scalar, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#8}::operator()() const::{lambda(float)#1}, at::detail::Array<char*, 2> >(int, at::native::clamp_kernel_cuda(at::TensorIterator&, c10::Scalar, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#8}::operator()() const::{lambda(float)#1}, at::detail::Array<char*, 2>) - num instrs 112
Instr 0 @ 0x0 (0) - IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R0, SR_CTAID.X ;
Instr 2 @ 0x20 (32) - ULDC.64 UR4, c[0x0][0x118] ;
Instr 3 @ 0x30 (48) - S2R R15, SR_TID.X ;
...

It is not a critical thing though. I now understand that.

NVlabs / NVBit

About instruction count tool #51