NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
53 stars 22 forks source link

Expected runtime definition #12

Closed sergicuen closed 2 years ago

sergicuen commented 2 years ago

Hi all, i am trying to estimate the Expected_runtime to define the Timeout fault. Usually the Expected_runtime measured during the normal_execution of the application is much shoter than the runtime measured by the tool when injecting faults (I guess that nvbitfi instrumentation is the cause of the delay) .

E.g: Inj_count=1, App=mEle_Sz256_Blk32, Mode=inst_value, Group=7, EM=0, Time=83.747101, Outcome: Masked: other reasons

So using the normal_execution time in the list of apps of params.py produces a lot of Timeouts. The other option is to use the maximum Time obtained in a DUMMY campaign. However in that case there are no Timeouts in the results. What is the right way to estimate the Expected runtime? Thank you in advance

sivahari commented 2 years ago

The application runtime with the error injection tool will depend on the kernel being instrumented. The best way is what you suggested - profile a dummy error injection campaign (using https://github.com/NVlabs/nvbitfi/blob/master/injector/Makefile#L16). The scripts currently assume that the application runtime with instrumentation (of one kernel in the application) will be 10x the uninstrumented runtime. See https://github.com/NVlabs/nvbitfi/blob/master/scripts/params.py#L31. If you are using the profiled run from an injection campaign, you may want to adjust this threshold to 2x (i.e., a hang will be detected if the application is running 2x longer than anticipated).