NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

NVBitFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluations

NVBitFI provides an automated framework to perform error injection campaigns for GPU application resilience evaluation. NVBitFI builds on top of NVIDIA Binary Instrumentation Tool (NVBit), which is a research prototype of a dynamic binary instrumentation library for NVIDIA GPUs. NVBitFI offers functionality that is similar to a prior tool called SASSIFI.

Please refer to our NVBitFI paper for additional details about the tool and some experimental results.

Summary of NVBitFI's capabilities

NVBitFI injects errors into the destination register values of a dynamic thread-instruction by instrumenting instructions after they are executed. A dynamic instruction is selected at random from all dynamic kernels of a program for error injection. Only one error is injected per run. This mode was referred to as IOV in SASSIFI. As of now (4/1/2020), NVBitFI allows us to select the following instruction groups to study how errors in them can propagate to the application output.

NVBitFI can be extended to include custom instruction groups. See below for more details.

For a selected destination register, following errors can be injected.

New bit-flip models can be added by modifying common/arch.h and injector/inject_func.cu and scripts/params.py.

Prerequisites

Getting started on a Linux x86_64 PC

The following commands are tested on an x86 system with Ubuntu 18.04 using CUDA-11.2 and NVBit version 1.5.5.

# NVBit-v1.5.5
wget https://github.com/NVlabs/NVBit/releases/download/1.5.5/nvbit-Linux-x86_64-1.5.5.tar.bz2
tar xvfj nvbit-Linux-x86_64-1.5.5.tar.bz2
cd nvbit_release/tools/

# NVBitFI 
git clone https://github.com/NVlabs/nvbitfi
cd nvbitfi
find . -name "*.sh" | xargs chmod +x
./test.sh

On an ARM-based device (e.g., Jetson Nano)

# NVBit-1.5.5
wget https://github.com/NVlabs/NVBit/releases/download/1.5.5/nvbit-Linux-aarch64-1.5.5.tar.bz2
tar xvfj nvbit-Linux-aarch64-1.5.5.tar.bz2
cd nvbit_release/tools/

# NVBitFI 
git clone https://github.com/NVlabs/nvbitfi
cd nvbitfi
find . -name "*.sh" | xargs chmod +x
./test.sh

If these commands complete without errors, you just completed your first error injection campaign using NVBitFI. The printed output should say where the results are stored. Summary of the campaign is stored in a tab-separated file, results_*NVbitFI_details.tsv. It can be opened using a spreadsheet program (e.g., Excel) for visualization and analysis.

Detailed steps

There are three main steps to run NVBitFI. We provide a sample script (test.sh) that automates nearly all these steps.

Step 0: Setup

Step 1: Profile and generate injection list

Step 2: Run the error injection campaign

Run scripts/run_injections.py to launch the error injection campaign. This script will run one injection run at a time in the standalone mode. If you plan to run multiple injection runs in parallel, please take special care to ensure that the output file is not clobbered. As of now, we support running multiple jobs on a multi-GPU system. Please see scripts/run_one_injection.py for more details.

Tip: Perform a few dummy injections before proceeding with full injection campaign (by setting DUMMY flag in injector/Makefile. Setting this flag will allow you to go through most of the SASSI handler code but skip the error injection. This is to ensure that you are not seeing crashes/SDCs that you should not see.

Step 3: Parse the results

Use the scripts/parse_results.py script to parse the results. This script generates three tab-separated values (tsv) files. The first file shows the fraction of executed instructions for different instruction groups and opcodes. The second file shows the outcomes of the error injections. Refer to CAT_STR in scripts/params.py for the list of error outcome categories. The third file shows the average runtime for the injection runs for different applications and selected error models. These files can be opened using a spreadsheet program (e.g., Excel) for plotting and analysis.

NVBitFI vs. SASSIFI

NVBitFI benefits from the featured offered by NVBit. It can run on newer GPUs (e.g., Turing and Volta GPUs). It works with pre-compiled libraries also, unlike SASSIFI. NVBitFI is expected to be faster than SASSIFI as it instruments just a single chosen dynamic kernel (SASSIFI, as it was implemented, instrumented all dynamic kernels) for the injection runs. As of now (April 14, 2020), NVBitFI implements a subset of the error injection models and we may be expanding this over time (users are more than welcome to contribute).

Contributing to NVBitFI

If you are interested in contributing to NVBitFI, please initialize a Pull Request and complete the Contributor License Agreement.