Closed mahmoodn closed 2 years ago
Both are kernel 0. inspecting
is static instruction info, kernel
is dynamic instruction info. inspecting
is printed since you enabled verbose mode of the tool.
OK I understand that. However, I want to know why "kernel" is not written for the first one. As you can see below -----
line, it starts with inspecting.
If you read the source code of the tool, you will find that inspecting
happens before kernel launch and only once for each kernel.
That is correct. According to the source code, the "kernel" statement is written at the end of the kernel instrumentation. Back to my question, it seems that the "empty new line" is not in a good position. I think there should be an empty line after "kernel" and before inspecting. Like this:
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;
kernel 0 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MinNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96, kernel instructions 450190, total instructions 450190
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
I am not sure how you got that new line. By running vectoradd with opcode_hist, I got:
inspecting vecAdd(double*, double*, double*, int) - num instrs 24
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;
Instr 2 @ 0x20 (32) - S2R R8, SR_CTAID.X ;
Instr 3 @ 0x30 (48) - S2R R3, SR_TID.X ;
Instr 4 @ 0x40 (64) - IMAD R8, R8, c[0x0][0x0], R3 ;
Instr 5 @ 0x50 (80) - ISETP.GE.AND P0, PT, R8, c[0x0][0x178], PT ;
Instr 6 @ 0x60 (96) - @P0 EXIT ;
Instr 7 @ 0x70 (112) - MOV R9, 0x8 ;
Instr 8 @ 0x80 (128) - IMAD.WIDE R4, R8, R9, c[0x0][0x168] ;
Instr 9 @ 0x90 (144) - IMAD.WIDE R2, R8, R9, c[0x0][0x160] ;
Instr 10 @ 0xa0 (160) - LDG.E.64.SYS R4, [R4] ;
Instr 11 @ 0xb0 (176) - LDG.E.64.SYS R2, [R2] ;
Instr 12 @ 0xc0 (192) - IMAD.WIDE R8, R8, R9, c[0x0][0x170] ;
Instr 13 @ 0xd0 (208) - DADD R6, R4, R2 ;
Instr 14 @ 0xe0 (224) - STG.E.64.SYS [R8], R6 ;
Instr 15 @ 0xf0 (240) - EXIT ;
Instr 16 @ 0x100 (256) - BRA 0x100;
Instr 17 @ 0x110 (272) - NOP;
Instr 18 @ 0x120 (288) - NOP;
Instr 19 @ 0x130 (304) - NOP;
Instr 20 @ 0x140 (320) - NOP;
Instr 21 @ 0x150 (336) - NOP;
Instr 22 @ 0x160 (352) - NOP;
Instr 23 @ 0x170 (368) - NOP;
kernel 0 - vecAdd(double*, double*, double*, int) - #thread-blocks 98, kernel instructions 50077, total instructions 50077
DADD = 3125
EXIT = 6261
IMAD = 3136
IMAD.WIDE = 9375
ISETP.GE.AND = 3136
LDG.E.64.SYS = 6250
MOV = 6261
S2R = 6272
SHFL.IDX = 3136
STG.E.64.SYS = 3125
Final sum of Sum_i C[i] = Sum_i (a[i] * b[i]) = 100000.000000
There is no new line between Instr ...
and kernel ...
. It is possible that new line is coming from your application, nothing to do with the tool. And your runs do not have any opcode histogram output. I am not sure if your application was running as expected.
I have to say that I ran instr_count tool and not the histogram. I edited the title.
$ LD_PRELOAD=/home/mahmood/nvbit_release/tools/instr_count/instr_count.so /usr/bin/python3 /home/mahmood/pt/main.py ------------- NVBit (NVidia Binary Instrumentation Tool v1.5.3) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
NVDISASM = nvdisasm - override default nvdisasm found in PATH
NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
MANGLED_NAMES = 1 - Print kernel names mangled or not
TOOL_VERBOSE = 1 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
Random Seed: 999
cuda is available
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MinNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
Instr 2 @ 0x20 (32) - ISETP.NE.AND P0, PT, RZ, c[0x0][0x33c], PT ;
Instr 3 @ 0x30 (48) - IMAD.MOV.U32 R12, RZ, RZ, c[0x0][0x164] ;
...
Instr 3004 @ 0xbbc0 (48064) - NOP;
Instr 3005 @ 0xbbd0 (48080) - NOP;
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;
kernel 0 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MinNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96, kernel instructions 450190, total instructions 450190
inspecting void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4> >(at::native::ReduceOp<float, at::native::func_wrapper_t<float, at::native::MaxNanFunctor<float> >, unsigned int, float, 4>) - num instrs 3008
Instr 0 @ 0x0 (0) - MOV R1, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R7, SR_TID.X ;
Instr 2 @ 0x20 (32) - ISETP.NE.AND P0, PT, RZ, c[0x0][0x33c], PT ;
...
Instr 3005 @ 0xbbd0 (48080) - NOP;
Instr 3006 @ 0xbbe0 (48096) - NOP;
Instr 3007 @ 0xbbf0 (48112) - NOP;
kernel 1 - _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_14func_wrapper_tIfNS0_13MaxNanFunctorIfEEEEjfLi4EEEEEvT1_ - #thread-blocks 96, kernel instructions 450190, total instructions 900380
inspecting void at::native::vectorized_elementwise_kernel<4, at::native::clamp_kernel_cuda(at::TensorIterator&, c10::Scalar, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#8}::operator()() const::{lambda(float)#1}, at::detail::Array<char*, 2> >(int, at::native::clamp_kernel_cuda(at::TensorIterator&, c10::Scalar, c10::Scalar)::{lambda()#1}::operator()() const::{lambda()#8}::operator()() const::{lambda(float)#1}, at::detail::Array<char*, 2>) - num instrs 112
Instr 0 @ 0x0 (0) - IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;
Instr 1 @ 0x10 (16) - S2R R0, SR_CTAID.X ;
Instr 2 @ 0x20 (32) - ULDC.64 UR4, c[0x0][0x118] ;
Instr 3 @ 0x30 (48) - S2R R15, SR_TID.X ;
...
It is not a critical thing though. I now understand that.
Hi With the histogram tool, I see the following output
I am confused with the order to
inspecting
andkernel
. There is a new line after the last instruction and before kernel 0. So, I wonder which one is kernel 0?_ZN2at6native13...
orvoid at::native::reduce_kernel
?