NVlabs / NVBit

199 stars 18 forks source link

Mask question #63

Closed rodhuega closed 2 years ago

rodhuega commented 2 years ago

Hi, I'm a user of Nvbit with accel-sim. I have observed a behavior that I don't understand why happens. Here is the kernel code(first kernel of backprop Rodinia2):

__global__ void
bpnn_layerforward_CUDA(float *input_cuda,
                       float *output_hidden_cuda,
                       float *input_hidden_cuda,
                       float *hidden_partial_sum,
                       int in,
                       int hid) 
{
   int by = blockIdx.y;
   int tx = threadIdx.x;
   int ty = threadIdx.y;

   int index =  ( hid + 1 ) * HEIGHT * by + ( hid + 1 ) * ty + tx + 1 + ( hid + 1 ) ;  
   int index_in = HEIGHT * by + ty + 1;

   __shared__ float input_node[HEIGHT];
   __shared__ float weight_matrix[HEIGHT][WIDTH];

   if ( tx == 0 )
   input_node[ty] = input_cuda[index_in] ;

   __syncthreads();
   weight_matrix[ty][tx] = input_hidden_cuda[index];
   __syncthreads();
   weight_matrix[ty][tx] = weight_matrix[ty][tx] * input_node[ty];
   __syncthreads();   

   for ( int i = 1 ; i <= __log2f(HEIGHT) ; i++){
       int power_two = __powf(2, i);
       if( ty % power_two == 0 )
       weight_matrix[ty][tx] = weight_matrix[ty][tx] + weight_matrix[ty + power_two/2][tx];
       __syncthreads();
   }

   input_hidden_cuda[index] = weight_matrix[ty][tx];
   __syncthreads();
   if ( tx == 0 ) {
       hidden_partial_sum[by * hid + ty] = weight_matrix[tx][ty];
   }
}

Below I paste a small piece of the trace code generated by Nvbit:

#BEGIN_TB

thread block = 0,0,0

warp = 0
insts = 137
0000 ffffffff 1 R1 IMAD.MOV.U32 2 R255 R255 0 
0010 00000000 0 SHFL.IDX 4 R255 R255 R255 R255 0 
0020 ffffffff 1 R4 S2R 0 0 
0030 ffffffff 1 R0 S2R 0 0 
0040 ffffffff 1 R5 S2R 0 0 
0050 ffffffff 0 ISETP.NE.AND 2 R4 R255 0 
0060 ffffffff 1 R2 IMAD 2 R5 R0 0 
0070 00010001 1 R9 IMAD.MOV.U32 2 R255 R255 0 
0080 00010001 1 R8 IMAD.WIDE 2 R2 R9 0 
0090 00010001 1 R12 LDG.E.SYS 1 R8 4 2 0x7f7091700004 -140104273231868 0 0 0 0 0 0 0 0 0 0 0 0 0 0 140104273231872 -140104273231872 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
00a0 ffffffff 1 R7 IMAD.MOV.U32 2 R255 R255 0 
00b0 ffffffff 1 R3 IADD3 2 R4 R255 0 
00c0 ffffffff 1 R6 IMAD.MOV.U32 2 R255 R255 0 
00d0 ffffffff 1 R7 IADD3 2 R7 R255 0 
00e0 ffffffff 1 R3 IMAD 3 R2 R7 R3 0 
00f0 ffffffff 1 R7 IMAD.SHL.U32 2 R0 R255 0 
0100 ffffffff 1 R2 IMAD.WIDE 2 R3 R6 0 
0110 ffffffff 1 R13 I2F.U32.RP 0 0 
0120 ffffffff 1 R13 MUFU.RCP 1 R13 0 
0130 ffffffff 1 R15 IADD3 2 R13 R255 0 
0140 ffffffff 1 R9 F2I.FTZ.U32.TRUNC.NTZ 1 R15 0 
0150 00010001 0 STS 2 R7 R12 4 2 0x0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0160 ffffffff 0 NOP 0 0 
0170 ffffffff 0 BAR.SYNC 0 0 

My question is how it is possible that at PC 0010 (16) of the trace, there is a instruction with the mask of threads with full 0s? Before this instruction there is only an IMAD and there isn't any branch or something like this.

PD: It might be useful for you the line where the accelsim tracer prints the mask https://github.com/accel-sim/accel-sim-framework/blob/4c2bf09a79d6b57bb10fe1898700930a5dd5531f/util/tracer_nvbit/tracer_tool/tracer_tool.cu#L529

ovilla commented 2 years ago

Unless some bug in the tool, the only thing I can think of is that the "SHFL.IDX 4 R255 R255 R255 R255 0" is actually "@!PT SHFL.IDX 4 R255 R255 R255 R255 0", so in reality the instruction itself is encoded to be always predicated as false (and somehow there is a bug in nvbit when we return the string text of it). You could check that with "cuobjdump -sass" on your application to confirm it. I will try to take a closer look next time I get around that code, but don't know yet when. Thanks for pointing this out.

rodhuega commented 2 years ago

I don't have access to a V100 that is the GPU that has generated that trace. Do you know if "cuobjdump -sass" would be similar with a GTX1080TI or RTX2080TI? These cards are the one that I have access.

rodhuega commented 2 years ago

Hi, I have compiled to volta even that I don't have a volta card and then checked the cuobjdump as you said. I can't identify any instruction similar to that SHFL. I don't know if I did something wrong. I searched in both kernels (the one that this happens is the _Z22bpnn_layerforward_CUDAPfS_S_S_ii kernel). I attach here two files. The trace generated for nvbit and downloaded from their repo and the cuobjdump generated as you said. I hope that this two files will be helpful. cuobjdumpOut.txt kernel1backproptrace.txt

ovilla commented 2 years ago

Thanks for the added information, we will take a detailed look when possible (not sure when yet).

x-y-z commented 2 years ago

Just checked the sass code of that kernel and confirm that the instruction is @!PT SHFL.IDX 4 R255 R255 R255 R255 0, so the instruction is always predicated off. Close the issue. Please reopen it if you find something different.