NVlabs / nvbitfi

Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation
Other
55 stars 22 forks source link

Error 139 and multi generation devices #9

Closed nnaron closed 3 years ago

nnaron commented 3 years ago

Hi,

Today I tried to install nvbitfi, but I have some issues.

1) here is the output of runing test.sh

Step 0 (2): Setting environment variables

Step 0 (3): Build the nvbitfi injector and profiler tools nvcc -ccbin=which gcc -D_FORCE_INLINES -I../../../core -I../common -maxrregcount=16 -Xptxas -astoolspatch --keep-device-functions -arch=sm_35 -DDUMMY=0 -Xcompiler -Wall -Xcompiler -fPIC -c inject_funcs.cu -o inject_funcs.o nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc -ccbin=which gcc -D_FORCE_INLINES -dc -c -std=c++11 -I../../../core -I../common -Xptxas -cloning=no -Xcompiler -Wall -arch=sm_35 -O3 -Xcompiler -fPIC injector.cu -o injector.o nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc -ccbin=which gcc -D_FORCE_INLINES -arch=sm_35 -O3 inject_funcs.o injector.o -L../../../core -lnvbit -L/usr/local/cuda-11.4/lib64 -lcuda -lcudart_static -shared -o injector.so nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc -ccbin=which gcc -D_FORCE_INLINES -I../../../core -I../common -maxrregcount=16 -Xptxas -astoolspatch --keep-device-functions -arch=sm_35 -Xcompiler -Wall -Xcompiler -fPIC -c inject_funcs.cu -o inject_funcs.o nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc -ccbin=which gcc -D_FORCE_INLINES -dc -c -std=c++11 -I../../../core -I../common -Xptxas -cloning=no -Xcompiler -Wall -arch=sm_35 -O3 -Xcompiler -fPIC profiler.cu -o profiler.o nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). nvcc -ccbin=which gcc -D_FORCE_INLINES -arch=sm_35 -O3 inject_funcs.o profiler.o -L../../../core -lnvbit -L /usr/local/cuda-11.4/lib64 -lcuda -lcudart_static -shared -o profiler.so nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

Step 0 (4): Run and collect output without instrumentation rm -f .o ~ simple_add which nvcc -o simple_add -Xptxas -v -arch=sm_35 simple_add.cu ./simple_add >golden_stdout.txt 2>golden_stderr.txt make: *** [Makefile:16: golden] Error 139

I do not understand what has been hepened. I have 2 GPUs that installed but I want to use nvbitfi for the second one but it seems that nvbitfi try to use the first one.

First GPU is Ampere and second one is Volta. I think Ampere is not supported.

2) I am seeing some -arch that non of them is not related to Volta. Should I do something here to change the arch?

3) I want to use this library to work on the bitflif of data (e.g. an error in input matrix ). Is nvbitfi useful?

sivahari commented 3 years ago

Try setting CUDA_VISIBLE_DEVICES to 1 to point to the Volta GPU. If you are using bash, try export CUDA_VISIBLE_DEVICES=1

sivahari commented 3 years ago

With nvbitfi's capabilities, you will be injected an error in a dynamic instruction being executing while the workload is running. If you want to inject only in the input, you may want to directly instrument your program. If you are interested, you can also try to inspect kernel entry points and try to inject an error at that time (https://github.com/NVlabs/nvbitfi/blob/master/injector/injector.cu#L333).

nnaron commented 3 years ago

Thanks for answers. Now after setting CUDA_VISIBLE_DEVICES to 1 and cleaning and compiling again, I am seeing the same error.

Step 0 (4): Run and collect output without instrumentation rm -f .o ~ simple_add which nvcc -o simple_add -Xptxas -v -arch=sm_35 simple_add.cu ./simple_add >golden_stdout.txt 2>golden_stderr.txt make: *** [Makefile:16: golden] Error 139

Is it installed correctly?

sivahari commented 3 years ago

In this step, the application is run without instrumentation. Can you run any application on the GPU? If yes, try compiling and running the simple_add application directly from its directory to inspect why it may be failing.