Closed lamhoangtung closed 3 years ago
Hey,
Could you please compile that way
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./src/nvtop 2> error.txt
and then post the content of the file error.txt
I have a similar problem with the segfault on the latest master commit, where the stderr didn't print anything. However, I do have short reports from systemd-coredump:
[tjm@ArchPad tmp]$ coredumpctl info 235492
PID: 235492 (nvtop)
UID: 1000 (tjm)
GID: 100 (users)
Signal: 11 (SEGV)
Timestamp: Mon 2021-05-24 05:38:30 PDT (42s ago)
Command Line: nvtop
Executable: /usr/bin/nvtop
Control Group: /user.slice/user-1000.slice/session-3.scope
Unit: session-3.scope
Slice: user-1000.slice
Session: 3
Owner UID: 1000 (tjm)
Boot ID: 3da7ccc2c46b4b619eeb7cf45882b3b8
Machine ID: ffd680d0906946c29ee244fc8114ae2c
Hostname: ArchPad
Storage: /var/lib/systemd/coredump/core.nvtop.1000.3da7ccc2c46b4b619eeb7cf45882b3b8.235492.1621859910000000.zst (present)
Disk Size: 79.1K
Message: Process 235492 (nvtop) of user 1000 dumped core.
Stack trace of thread 235492:
#0 0x0000557be2f8d882 draw_processes (nvtop + 0x7882)
#1 0x0000557be2f8a5cf main (nvtop + 0x45cf)
#2 0x00007efddff30b25 __libc_start_main (libc.so.6 + 0x27b25)
#3 0x0000557be2f8a8ee _start (nvtop + 0x48ee)
[tjm@ArchPad tmp]$
From the coredump file, we could see there might be an NPE in interface.c:1251
[tjm@ArchPad tmp]$ gdb -q /usr/bin/nvtop core.nvtop.235492
Reading symbols from /usr/bin/nvtop...
[New LWP 235492]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `nvtop'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000557be2f8d882 in draw_processes (interface=0x557be33e1080, devices=<optimized out>, devices_count=1)
at /home/tjm/.cache/pikaur/build/nvtop-git/src/nvtop-git/src/interface.c:1251
1251 all_procs.processes[interface->process.selected_row].process->pid;
(gdb)
Same issuse for me. On my device, this issuse raises when a single process is using multiple GPUs.
Here is the output of nvtop 2>debug.txt
ASAN:DEADLYSIGNAL
=================================================================
==75378==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fa5a3bf77c6 bp 0x7ffdd51dfc80 sp 0x7ffdd51df3f8 T0)
#0 0x7fa5a3bf77c5 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5)
#1 0x7fa5a447febb (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x3cebb)
#2 0x40f09c in draw_processes /home/panxuehai/Projects/nvtop/src/interface.c:1257
#3 0x411dea in draw_gpu_info_ncurses /home/panxuehai/Projects/nvtop/src/interface.c:1685
#4 0x404e3d in main /home/panxuehai/Projects/nvtop/src/nvtop.c:341
#5 0x7fa5a3b8c83f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
#6 0x403c18 in _start (/home/panxuehai/Projects/nvtop/build/src/nvtop+0x403c18)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5) in strlen
==75378==ABORTING
Steps to reproduce:
run nvtop
first.
run the following Python code:
$ ipython3
In [1]: import cupy as cp
In [2]: with cp.cuda.Device(0):
...: x = cp.zeros((10000, 1000))
...:
In [2]: with cp.cuda.Device(1):
...: y = cp.zeros((10000, 1000))
...:
If I reverse the order of step 1 and step 2, nvtop
will run as expected.
I think that the patch in the branch fix_segfault should do the trick.
When writing this bit of code I assumed for some reason that there will always be one process running on the GPUs, which is not the case for a server.
Could any of you tell me if the patch solves this issue?
To test you must checkout to the correct branch:
git pull
git checkout fix_segfault
# Build as usual
The issuse still exists.
Could any of you tell me if the patch solves this issue?
Can you please provide the error.txt output for this branch?
../src/interface.c:1266:25: runtime error: null pointer passed as argument 1, which is declared to never be null
AddressSanitizer:DEADLYSIGNAL
=================================================================
==29805==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f164627b7c6 bp 0x7ffe12c16ca0 sp 0x7ffe12c16448 T0)
==29805==The signal is caused by a READ memory access.
==29805==Hint: address points to the zero page.
#0 0x7f164627b7c6 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6)
#1 0x7f1647475cdc (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x3fcdc)
#2 0x41c740 in draw_processes ../src/interface.c:1266
#3 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
#4 0x407693 in main ../src/nvtop.c:341
#5 0x7f164621083f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
#6 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6) in strlen
==29805==ABORTING
strlen
gets a nullptr
for all_procs.processes[i].process->user_name
here:
I think this may be caused by the PID info cache for processes that using multiple GPUs.
Thank you.
Another patch has been pushed to fix this bug on the branch fix_segfault.
=================================================================
==88252==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffeb142b2af at pc 0x7f5ef83ca1e6 bp 0x7ffeb142b060 sp 0x7ffeb142a810
WRITE of size 7 at 0x7ffeb142b2af thread T0
#0 0x7f5ef83ca1e5 in __interceptor_vsnprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5)
#1 0x7f5ef83ca3ee in __interceptor_snprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x603ee)
#2 0x41aa61 in print_processes_on_screen ../src/interface.c:1168
#3 0x41c900 in draw_processes ../src/interface.c:1273
#4 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
#5 0x407693 in main ../src/nvtop.c:341
#6 0x7f5ef714483f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
#7 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)
Address 0x7ffeb142b2af is located in stack of thread T0
SUMMARY: AddressSanitizer: dynamic-stack-buffer-overflow (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5) in __interceptor_vsnprintf
Shadow bytes around the buggy address:
0x10005627d600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x10005627d610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x10005627d620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x10005627d630: ca ca ca ca 00 02 cb cb cb cb cb cb 00 00 00 00
0x10005627d640: ca ca ca ca 07 cb cb cb cb cb cb cb 00 00 00 00
=>0x10005627d650: ca ca ca ca 00[07]cb cb cb cb cb cb 00 00 00 00
0x10005627d660: ca ca ca ca 04 cb cb cb cb cb cb cb 00 00 00 00
0x10005627d670: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
0x10005627d680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x10005627d690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x10005627d6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==88252==ABORTING
It gets a dynamic stack overflow error when user_name
is not NULL
for the newest patch.
I get the same stack overflow error when I simply add user_name != NULL
check on the previous patch.
if (IS_VALID(gpuinfo_process_user_name_valid,
all_procs.processes[i].process->valid) &&
all_procs.processes[i].process->user_name != NULL)
{
unsigned length = strlen(all_procs.processes[i].process->user_name);
if (length > largest_username)
largest_username = length;
}
It seems unrelated, it was on another part of the program, where the sprintf function could print outside of the buffer. Yet another patch available.
Error of the third patch:
nvtop: ../src/extract_gpuinfo.c:200: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.
The output adding printf
before the assertion:
if (IS_VALID(gpuinfo_total_memory_valid, devices[i].dynamic_info.valid) &&
IS_VALID(gpuinfo_process_gpu_memory_usage_valid,
devices[i].processes[j].valid)) {
float percentage =
roundf(100.f * (float)devices[i].processes[j].gpu_memory_usage /
(float)devices[i].dynamic_info.total_memory);
devices[i].processes[j].gpu_memory_percentage = (unsigned)percentage;
fprintf(stderr,
"gpu_memory_usage=%llu total_memory=%llu percentage=%f gpu_memory_percentage=%llu\n",
devices[i].processes[j].gpu_memory_usage, devices[i].dynamic_info.total_memory,
percentage, devices[i].processes[j].gpu_memory_percentage);
assert(devices[i].processes[j].gpu_memory_percentage <= 100);
SET_VALID(gpuinfo_process_gpu_memory_percentage_valid,
devices[i].processes[j].valid);
}
gpu_memory_usage=10585374720 total_memory=11554717696 percentage=92.000000 gpu_memory_percentage=92
gpu_memory_usage=10585374720 total_memory=11554717696 percentage=92.000000 gpu_memory_percentage=92
gpu_memory_usage=10585374720 total_memory=11554717696 percentage=92.000000 gpu_memory_percentage=92
gpu_memory_usage=10585374720 total_memory=11554717696 percentage=92.000000 gpu_memory_percentage=92
gpu_memory_usage=7961837568 total_memory=11554717696 percentage=69.000000 gpu_memory_percentage=69
gpu_memory_usage=6277824512 total_memory=11554717696 percentage=54.000000 gpu_memory_percentage=54
gpu_memory_usage=1081081856 total_memory=11554717696 percentage=9.000000 gpu_memory_percentage=9
gpu_memory_usage=13744632839234567870 total_memory=11554717696 percentage=118952566784.000000 gpu_memory_percentage=2988449792
nvtop: ../src/extract_gpuinfo.c:204: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.
Thanks for finding this one. I copied the struct definition from the header, and there was only one version, so I assumed backward compatibility. I double checked for the other functions and the types remained the same.
Is that all there was?
It works fine on my machine with driver version 430.64 / CUDA 10.1 on Ubuntu 16.04 LTS.
All right, I merged the patches into master.
Thanks a lot @XuehaiPan for your help fixing these and @TommyJerryMairo for providing the process dump.
Take care
Thanks u guys for the help
Hi, I'm installing nvtop on Ubuntu 18.04.5 LTS following the build instruction in this repo. The build went smooth and there are no warning or error.
But when trying to launch, nvtop, I got the error:
Segmentation fault (core dumped)
Here is my
nvidia-smi
output:Also I did some test and find out that the last working commit for me was 0ef51c99ee1598cfc7fd806f122fbaf315f81a26. From that point on I always get this error.
Are there anything I can help to resolve this. Thanks ;)