NVlabs / NVBit

220 stars 20 forks source link

trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_linkable_tn_v1 trace error #43

Open leiwen83 opened 3 years ago

leiwen83 commented 3 years ago

Hi,

When I use nvbit to trace one program containing TensorRT kernel, it report illegal memory access for the sample plugin like instr_count or instr_count_bb.

The kernel name is trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_linkable_tn_v1, and tensorrt version is 7.2.1, while nvbit is also the latest version.

Thx, Lei

x-y-z commented 3 years ago

Do you mind sharing the binary so that we can use it for debugging? Thanks.

leiwen83 commented 3 years ago

Hi,

using the onnx2trt could also reproduce it. onnx file could be found at: https://media.githubusercontent.com/media/onnx/models/master/vision/classification/resnet/model/resnet50-v2-7.onnx

Then using like instr_count.so:

LD_PRELOAD=./instr_count.so onnx2trt -b 1 -d 16 -w 20000000000 resnet50-v2-7.onnx -o 1.trt

x-y-z commented 3 years ago

I am not able to reproduce it with my local binary.

  1. My binary does not have trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_linkable_tn_v1 but only trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1, trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_medium_nhwc_tn_v1 and trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_large_nhwc_tn_v1.
  2. no crash happened when instrumenting my trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_* kernels.