NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

Total instruction count = 0 while analyzing TensorRT binary #13

Closed zoythum closed 2 years ago

zoythum commented 2 years ago

NVBitFI not able to find instructions

Setup

I followed this guide on how to build an image recognition program in C++ using TensorRT (included in the imageNet library in the code). I can correctly compile and run the program. The apps dictionary in scripts/params.py has been modified like this \ apps = { 'recognition': [ NVBITFI_HOME + '/test-apps/recognition', # workload directory 'recognition', # binary name NVBITFI_HOME + '/test-apps/recognition/', # path to the binary file 1, # expected runtime "" # additional parameters to the run.sh ], }

Inside the test-apps folder I created a new folder called recognition that contains the binary called recognition, a Makefile which has a target called golden built like this golden: ./recognition $(ARGS) >golden_stdout.txt 2>golden_stderr.txt, a file called run.sh with the following content #!/bin/bash eval ${PRELOAD_FLAG} ${BIN_DIR}/recognition polar_bear.jpg > stdout.txt 2> stderr.txt and a file called sdc_check.sh which I simply copied from the official repository.

Lastly, I modified the test.sh script in order to execute the right binary at the beginning with the following code printf "\nStep 0 (4): Run and collect output without instrumentation\n" cd test-apps/recognition/ make golden ARGS=polar_bear.jpg cd $CWD

Problem

During step 1 (2) of the execution script available I encounter the following error reating list for recognition ... Something is not right. Total instruction count = 0 It seems like the tool is not able to find any instruction in the binary and the execution stops.

ovilla commented 2 years ago

Hi,

Which CUDA driver and system are you using exactly?

Is there a chance you can try the same binary with with a simpler tool in NVBit (https://github.com/NVlabs/NVBit). For instance the instruction count tool.

To inject the tool you can use LD_PRELOAD or CUDA_INJECTION64_PATH Thanks.

zoythum commented 2 years ago

I am using a Jetson TX2 board, the output of nvcc --version is the following (which should identify the CUDA driver if I'm not mistaken)

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_28_22:34:44_PST_2021 Cuda compilation tools, release 10.2, V10.2.300 Build cuda_10.2_r440.TC440_70.29663091_0

Running the binary with the instruction count tool makes it crash with the following error

recognition: arch/gm10x_hal.cpp:181: void set_imm_relative_control_flow(uint64_t*, int64_t): Assertion !IS_LARGER_THAN_24BIT(imm) failed. Aborted (core dumped)

ovilla commented 2 years ago

Thanks for the information, this narrows down the issue a lot. It is an NVBit problem. We will look into it.

ovilla commented 2 years ago

I forgot to mention, this issue should appear only in Maxwell/Pascal GPUs. If you have access to a Jetson Xavier you should not see this issue.

sivahari commented 2 years ago

The latest NVBit release should fix this issue (version 1.5.5). https://github.com/NVlabs/NVBit/releases