NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

jetson-nano Kernel failure #3

Closed Frinhard closed 3 years ago

Frinhard commented 3 years ago

Hi there! I am using NVBitFI on the jeston nano, and running the example at times I get a "Outcome: Pot DUE: SDC but Kernel Error". Looking at the stdout.txt, it says "ERROR FAIL in kernel execution (unspecified launch failure);" and I found weird that the Device is identified as Tegra TX1 "Device 0 (NVIDIA Tegra X1) is being used". I have tried to change the architecture to sm_53, but the problem persists. Is this an error due to the injection? I am not sure about it. It is not a DUE yet not an SDC. How should we consider it? Can this be something to do with the fact that the GPU is used also for running X in linux? I am using it through ssh, though.

Thanks!

sivahari commented 3 years ago

We discussed this via email. For others, following these steps helped resolve the problem: (1) Ensure that the application runs fine without errors on the device (some applications that run on larger GPUs may not run based on the DRAM requirement). (2) Ensure that you see no crashes when you run with DUMMY=1 (https://github.com/NVlabs/nvbitfi/blob/master/injector/Makefile#L16). You should ideally see Masked runs as no error will be injected. In this case, reducing the problem size helped.

sergicuen commented 3 years ago

Hi Paolo and Sivahari, I´ve ran the simple_add in the Jetson with the same results: (NVIDIA Tegra X1) and some "Outcome: Pot DUE: SDC but Kernel Error" following the indications I tried with DUMMY=1 and all is ok. 1) I guess the fault injection is correct but the classification of the fault is a little confusing. I think there was a kernel crahs, and after the execution finished with erroneous output, so for me is a DUE. , could you tell me how to consider this fault? 2) I found the following in NVIDIA manuals: "Note that Tegra X1 Technical Reference Manual applies to Jetson Nano as well as NVIDIA® Jetson™ TX1" . So I guess the identifier is ok