Closed claus-h-g closed 1 year ago
Hello @claus-h-g,
Could you provide the location of the segfault by using gdb or an other debugger to print the stack trace?
Many thanks for your feedback and request for additional information.
Sorry for the late reply. Other things needed my full time attention.
I am using these tools for the first time and followed this instructions: http://www.cs.toronto.edu/~krueger/csc209h/tut/gdb_tutorial.html
On a fresh installed xUbuntu 18 system I did build nvtop from source. no error output of gdb $ gdb nvtop GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from nvtop...(no debugging symbols found)...done.
(gdb) run Starting program: /usr/local/bin/nvtop
Program received signal SIGSEGV, Segmentation fault. 0x0000555555568238 in gpuinfo_amdgpu_get_device_handles ()
(gdb) backtrace
(gdb)
Hope I do provide the required information.
Thanks a lot, that's what I was looking for. I looked at the code inside the function where the SIGSEGV signal happens, but I cannot find anything that would do so. I tried to reproduce inside an Ubuntu 18.04 container but no luck either.
If it is not too much to ask, when you are compiling nvtop use cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON
.
And then do another run with gdb. That way I will be able to see at which in line from the source code the error occurs.
Many thanks for your reply. English is not my mother tongue - so I am not 100 % if I understood you reply correctly. To avoid any misunderstanding, I try to describe my steps the best I can.
I did run the compiling in with the following commands: git clone https://github.com/Syllo/nvtop.git mkdir -p nvtop/build && cd nvtop/build cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON.
make
sudo make install
runing nvtop wtihout gdb:
==7327==ERROR: AddressSanitizer: SEGV on unknown address 0x60d800000450 (pc 0x5614190af4fc bp 0x7ffc90fea220 sp 0x7ffc90fea0a0 T0) ==7327==The signal is caused by a READ memory access.
#1 0x561419094e20 in gpuinfo_init_info_extraction /applications/nvtop/src/extract_gpuinfo.c:67
#2 0x5614190650f7 in main /applications/nvtop/src/nvtop.c:256
#3 0x7f836415ec86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#4 0x561419064109 in _start (/usr/local/bin/nvtop+0x7e109)
AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /applications/nvtop/src/extract_gpuinfo_amdgpu.c:364 in gpuinfo_amdgpu_get_device_handles ==7327==ABORTING
using gdb:
$ gdb nvtop GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1 Copyright (C) 2018 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from nvtop...done. (gdb) run Starting program: /usr/local/bin/nvtop [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". /applications/nvtop/src/extract_gpuinfo_amdgpu.c:361:15: runtime error: shift exponent 4294967295 is too large for 32-bit type 'int'
Program received signal SIGSEGV, Segmentation fault. 0x000055555561d4fc in gpuinfo_amdgpu_get_device_handles (devices=0x7fffffffdeb0, count=0x7fffffffdca0, mask=0x7fffffffdc58) at /applications/nvtop/src/extract_gpuinfo_amdgpu.c:364 364 if ((fd = open(devs[i]->nodes[j], O_RDWR)) < 0) (gdb)
Hope you can derive the required information.
Sorry for the bold header type of formatting - I do not know how to correct this.
That was exactly what I was looking for, thank you.
I pushed a patch that should fix the issue you are encountering.
Could you please test the branch fix_amdgpu_device_handle_segfault
to check if it works?
Here are the steps:
git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
git checkout fix_amdgpu_device_handle_segfault
cmake .. -DCMAKE_BUILD_TYPE=Debug -DNVIDIA_SUPPORT=ON -DAMDGPU_SUPPORT=ON
make
./src/nvtop
Thanks for your rapid reply and fix. I did follow the steps you detailed.
Make resulted in one warning
Scanning dependencies of target nvtop [ 7%] Building C object src/CMakeFiles/nvtop.dir/nvtop.c.o [ 14%] Building C object src/CMakeFiles/nvtop.dir/interface.c.o In file included from /applications/nvtop/src/interface.c:42:0: /applications/nvtop/src/interface.c: In function 'draw_percentage_meter': /applications/nvtop/src/interface.c:496:23: warning: implicit conversion from 'float' to 'double' when passing argument to function [-Wdouble-promotion] float usage = round((float)between_sbraces * new_percentage / 100.f); ^ [ 21%] Building C object src/CMakeFiles/nvtop.dir/interface_layout_selection.c.o [ 28%] Building C object src/CMakeFiles/nvtop.dir/interface_options.c.o [ 35%] Building C object src/CMakeFiles/nvtop.dir/interface_setup_win.c.o [ 42%] Building C object src/CMakeFiles/nvtop.dir/interface_ring_buffer.c.o [ 50%] Building C object src/CMakeFiles/nvtop.dir/get_process_info_linux.c.o [ 57%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo.c.o [ 64%] Building C object src/CMakeFiles/nvtop.dir/time.c.o [ 71%] Building C object src/CMakeFiles/nvtop.dir/plot.c.o [ 78%] Building C object src/CMakeFiles/nvtop.dir/ini.c.o [ 85%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo_nvidia.c.o [ 92%] Building C object src/CMakeFiles/nvtop.dir/extract_gpuinfo_amdgpu.c.o [100%] Linking C executable nvtop [100%] Built target nvtop
running ./src/nvtop the expected UI was shown:
Many thanks for your rapid help
Glad it worked. I'll merge the branch into master.
Many thanks for providing nvtop. I just build nvtop from source on a freshly installed xUbuntu 18.04 with CUDA 11.7 on a NVIDIA GeForce RTX 2070. I encountered a segmentation fault. Yesterday build of nvtop on a second xUbuntu 18 system with CUDA 11.6 NVIDIA GeForce RTX 3060 is running perfectly fine as on other systems
Is it possible the new version of CUDA is causing the segmentation fault - as described in #107 (https://github.com/Syllo/nvtop/issues/107)