NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

Can't detect any instruction #1

Open Andre4102 opened 3 years ago

Andre4102 commented 3 years ago

I wrote a Cuda implementation for the convolution in order to do fault injection on a CNN but the injector does not detect any instruction. I use Pytorch C++ as a framework. When I open the file stdout.txt in the directory where I store the executable, I get the following output: inspecting forward_cuda_kernel(at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 1ul, at::RestrictPtrTraits, int>, at::GenericPackedTensorAccessor<float, 4ul, at::RestrictPtrTraits, int>, int, int, int, int, int, int) - num instrs 456 and a very long list of instructions (more than 1300) terminating with NVBit-igprofile; ERROR FAIL in kernel execution!! I was wondering what could be the meaning of this and how to solve it.

sivahari commented 3 years ago

At the time of instrumentation, the details are printed (https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L107, https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L122), based on the TOOL_VERBOSE flag (https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L73).

The error message comes from here: https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L178. Looks like the instrumented kernel crashed during execution. One possibility of that an instruction is used that's not supported by nvbitfi. One can list the unique opcodes printed during instrumentation and check if one of them is missing from enum InstructionType (https://github.com/NVlabs/nvbitfi/blob/master/common/arch.h).

The first thing to check is whether the kernel succeeds without instrumentation using code like this in the CUDA code after the kernel execution: https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L177.

Andre4102 commented 3 years ago

I checked as you suggested and the kernel succeeds without instrumentation, I guess the operation is not supported by NVbitFI. Is there a way to see all the kernel instructions called during execution? Even the one(s) that make the program fail?

sivahari commented 3 years ago

With TOOL_VERBOSE=1, all the instructions in the kernel should be printed (i->print() here: https://github.com/NVlabs/nvbitfi/blob/master/profiler/profiler.cu#L122). You can then compare the printed list with enum InstructionType (https://github.com/NVlabs/nvbitfi/blob/master/common/arch.h). Feel free to send me more details (e.g., CUDA code, if you feel conformable sharing it and the instructions printed with the above instructions) via email and I'll take a look.

One more thing to try is to ensure that you are able to instrument the kernel and execute with one of the provided NVBit tools (e.g., operation histogram).

Andre4102 commented 3 years ago

I wrote you a mail with the code @sivahari