NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

Error whilst using test.sh #8

Open AlexGrazioli opened 3 years ago

AlexGrazioli commented 3 years ago

Hello everyone, I am a computer science graduating, and I am trying to learn a bit more about the mechanism of the fault injection through the nvbitfi tool. After the set-up of the tools (nvbit and nvbitfi) i tried to run the pre-loaded test "test.sh" to see if everything would work. Unfortunately this was not the case, at least for me, in fact during the Step 1 (1): Profile the application, the script "run.sh" in the nvbitfi/test-apps/simple_add wouldn't create the log file needed to proceed. I proceeded to check the script and i edited in the way that would create the log file.txt needed. After that, during the step Step 1 (2): Generate injection list for instruction-level error injections, the only output that returns is just "Something is not right. Total instruction count = 0". From this point on, i couldn't manage to understand the problem, which still now persists. I ran the tools, on: DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS" NAME="Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.2 LTS" VERSION_ID="20.04"

with a kernel version: Linux 5.8.0-55-generic x86_64 on a acer laptop, with nvidia 920m graphic card and Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz. I made sure to meet all the requirements for the correct use of the tool.

I wonder if someone could help me to fix the problem, so i can continue to learn and study this tool.

Thank you in advance. Best regards

Msabih commented 2 years ago

I have exactly same problem.

TylerMG26 commented 2 years ago

I am also testing this out on 20.04 as well and having the same issue. I've used this tool before and haven't had the issue previously. Trying to determine what is different.

TylerMG26 commented 2 years ago

I have narrowed down the issue a bit.

When running outside the test.sh script and doing: LD_PRELOAD=//profiler.so /simple_add I get the following output.

NVBit (NVidia Binary Instrumentation Tool v1.5.5) Loaded NVBit core environment variables (mostly for nvbit-devs): NVDISASM = nvdisasm - override default nvdisasm found in PATH NOBANNER = 0 - if set, does not print this banner

Device 0 (Xavier) is being used memory: 6.6947 GB ECC off 6 SMs x1109000

CTAs=10, nreps=10, threads/CTA=1024

*ASSERT FAIL: nvbit_imp.cpp:1628:std::vector<CUfunc_st> Nvbit::get_related_functions(CUcontext, CUfunction): FAIL !(function)**

I also get the same error when trying other NVBit tools such as instr_count.

I am running JetPack 5.0 on the Jetson Xavier NX. I have verified that I don't have this issue when running NVBitFI on JetPack 4.5/4.6 with NVBit 1.5.5.

At first glance it seems to be related to NVBit on Ubuntu 20.04. I have also tried rolling back to an older version of NVBit (1.5.2) and I get the same error.