Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
7.79k stars 287 forks source link

Segmentation fault (core dumped) with CUDA 11.7 #157

Closed claus-h-g closed 1 year ago

claus-h-g commented 1 year ago

Many thanks for providing nvtop. I just build nvtop from source on a freshly installed xUbuntu 18.04 with CUDA 11.7 on a NVIDIA GeForce RTX 2070. I encountered a segmentation fault. Yesterday build of nvtop on a second xUbuntu 18 system with CUDA 11.6 NVIDIA GeForce RTX 3060 is running perfectly fine as on other systems

Is it possible the new version of CUDA is causing the segmentation fault - as described in #107 (https://github.com/Syllo/nvtop/issues/107)

Syllo commented 1 year ago

Hello @claus-h-g,

Could you provide the location of the segfault by using gdb or an other debugger to print the stack trace?

claus-h-g commented 1 year ago

Many thanks for your feedback and request for additional information.

Sorry for the late reply. Other things needed my full time attention.

I am using these tools for the first time and followed this instructions: http://www.cs.toronto.edu/~krueger/csc209h/tut/gdb_tutorial.html

On a fresh installed xUbuntu 18 system I did build nvtop from source. no error output of gdb $ gdb nvtop GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from nvtop...(no debugging symbols found)...done.

(gdb) run Starting program: /usr/local/bin/nvtop

Program received signal SIGSEGV, Segmentation fault. 0x0000555555568238 in gpuinfo_amdgpu_get_device_handles ()

(gdb) backtrace

0 0x0000555555568238 in gpuinfo_amdgpu_get_device_handles ()

1 0x0000555555562fab in gpuinfo_init_info_extraction ()

2 0x0000555555558b3a in main ()

(gdb)

Hope I do provide the required information.

Syllo commented 1 year ago

Thanks a lot, that's what I was looking for. I looked at the code inside the function where the SIGSEGV signal happens, but I cannot find anything that would do so. I tried to reproduce inside an Ubuntu 18.04 container but no luck either.

If it is not too much to ask, when you are compiling nvtop use cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON. And then do another run with gdb. That way I will be able to see at which in line from the source code the error occurs.

claus-h-g commented 1 year ago

Many thanks for your reply. English is not my mother tongue - so I am not 100 % if I understood you reply correctly. To avoid any misunderstanding, I try to describe my steps the best I can.

I did run the compiling in with the following commands: git clone https://github.com/Syllo/nvtop.git mkdir -p nvtop/build && cd nvtop/build cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON.

make

sudo make install

runing nvtop wtihout gdb:

$ nvtop /applications/nvtop/src/extract_gpuinfo_amdgpu.c:361:15: runtime error: shift exponent 4294967295 is too large for 32-bit type 'int' ASAN:DEADLYSIGNAL

==7327==ERROR: AddressSanitizer: SEGV on unknown address 0x60d800000450 (pc 0x5614190af4fc bp 0x7ffc90fea220 sp 0x7ffc90fea0a0 T0) ==7327==The signal is caused by a READ memory access.

0 0x5614190af4fb in gpuinfo_amdgpu_get_device_handles /applications/nvtop/src/extract_gpuinfo_amdgpu.c:364

#1 0x561419094e20 in gpuinfo_init_info_extraction /applications/nvtop/src/extract_gpuinfo.c:67
#2 0x5614190650f7 in main /applications/nvtop/src/nvtop.c:256
#3 0x7f836415ec86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#4 0x561419064109 in _start (/usr/local/bin/nvtop+0x7e109)

AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /applications/nvtop/src/extract_gpuinfo_amdgpu.c:364 in gpuinfo_amdgpu_get_device_handles ==7327==ABORTING

using gdb:

$ gdb nvtop GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from nvtop...done. (gdb) run Starting program: /usr/local/bin/nvtop [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". /applications/nvtop/src/extract_gpuinfo_amdgpu.c:361:15: runtime error: shift exponent 4294967295 is too large for 32-bit type 'int'

Program received signal SIGSEGV, Segmentation fault. 0x000055555561d4fc in gpuinfo_amdgpu_get_device_handles (devices=0x7fffffffdeb0, count=0x7fffffffdca0, mask=0x7fffffffdc58) at /applications/nvtop/src/extract_gpuinfo_amdgpu.c:364 364 if ((fd = open(devs[i]->nodes[j], O_RDWR)) < 0) (gdb)

Hope you can derive the required information.

Sorry for the bold header type of formatting - I do not know how to correct this.

Syllo commented 1 year ago

That was exactly what I was looking for, thank you.

I pushed a patch that should fix the issue you are encountering.

Could you please test the branch fix_amdgpu_device_handle_segfault to check if it works?

Here are the steps:

git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
git checkout fix_amdgpu_device_handle_segfault
cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON
make
./src/nvtop
claus-h-g commented 1 year ago

Thanks for your rapid reply and fix. I did follow the steps you detailed.

Make resulted in one warning

Scanning dependencies of target nvtop [ 7%] Building C object src/CMakeFiles/nvtop.dir/nvtop.c.o [ 14%] Building C object src/CMakeFiles/nvtop.dir/interface.c.o In file included from /applications/nvtop/src/interface.c:42:0: /applications/nvtop/src/interface.c: In function 'draw_percentage_meter': /applications/nvtop/src/interface.c:496:23: warning: implicit conversion from 'float' to 'double' when passing argument to function [-Wdouble-promotion] float usage = round((float)between_sbraces * new_percentage / 100.f); ^ [ 21%] Building C object src/CMakeFiles/nvtop.dir/interface_layout_selection.c.o [ 28%] Building C object src/CMakeFiles/nvtop.dir/interface_options.c.o [ 35%] Building C object src/CMakeFiles/nvtop.dir/interface_setup_win.c.o [ 42%] Building C object src/CMakeFiles/nvtop.dir/interface_ring_buffer.c.o [ 50%] Building C object src/CMakeFiles/nvtop.dir/get_process_info_linux.c.o [ 57%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo.c.o [ 64%] Building C object src/CMakeFiles/nvtop.dir/time.c.o [ 71%] Building C object src/CMakeFiles/nvtop.dir/plot.c.o [ 78%] Building C object src/CMakeFiles/nvtop.dir/ini.c.o [ 85%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo_nvidia.c.o [ 92%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo_amdgpu.c.o [100%] Linking C executable nvtop [100%] Built target nvtop

running ./src/nvtop the expected UI was shown: image

Many thanks for your rapid help

Syllo commented 1 year ago

Glad it worked. I'll merge the branch into master.