Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
7.95k stars 291 forks source link

Segmentation fault (core dumped) #107

Closed lamhoangtung closed 3 years ago

lamhoangtung commented 3 years ago

Hi, I'm installing nvtop on Ubuntu 18.04.5 LTS following the build instruction in this repo. The build went smooth and there are no warning or error.

But when trying to launch, nvtop, I got the error: Segmentation fault (core dumped)

Here is my nvidia-smi output:

Mon May 24 04:28:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    39W / 250W |   3025MiB / 16280MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also I did some test and find out that the last working commit for me was 0ef51c99ee1598cfc7fd806f122fbaf315f81a26. From that point on I always get this error.

Are there anything I can help to resolve this. Thanks ;)

Syllo commented 3 years ago

Hey,

Could you please compile that way

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./src/nvtop 2> error.txt

and then post the content of the file error.txt

TommyJerryMairo commented 3 years ago

I have a similar problem with the segfault on the latest master commit, where the stderr didn't print anything. However, I do have short reports from systemd-coredump:

[tjm@ArchPad tmp]$ coredumpctl info 235492
           PID: 235492 (nvtop)
           UID: 1000 (tjm)
           GID: 100 (users)
        Signal: 11 (SEGV)
     Timestamp: Mon 2021-05-24 05:38:30 PDT (42s ago)
  Command Line: nvtop
    Executable: /usr/bin/nvtop
 Control Group: /user.slice/user-1000.slice/session-3.scope
          Unit: session-3.scope
         Slice: user-1000.slice
       Session: 3
     Owner UID: 1000 (tjm)
       Boot ID: 3da7ccc2c46b4b619eeb7cf45882b3b8
    Machine ID: ffd680d0906946c29ee244fc8114ae2c
      Hostname: ArchPad
       Storage: /var/lib/systemd/coredump/core.nvtop.1000.3da7ccc2c46b4b619eeb7cf45882b3b8.235492.1621859910000000.zst (present)
     Disk Size: 79.1K
       Message: Process 235492 (nvtop) of user 1000 dumped core.

                Stack trace of thread 235492:
                #0  0x0000557be2f8d882 draw_processes (nvtop + 0x7882)
                #1  0x0000557be2f8a5cf main (nvtop + 0x45cf)
                #2  0x00007efddff30b25 __libc_start_main (libc.so.6 + 0x27b25)
                #3  0x0000557be2f8a8ee _start (nvtop + 0x48ee)
[tjm@ArchPad tmp]$ 

From the coredump file, we could see there might be an NPE in interface.c:1251

[tjm@ArchPad tmp]$ gdb -q /usr/bin/nvtop core.nvtop.235492 
Reading symbols from /usr/bin/nvtop...
[New LWP 235492]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `nvtop'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000557be2f8d882 in draw_processes (interface=0x557be33e1080, devices=<optimized out>, devices_count=1)
    at /home/tjm/.cache/pikaur/build/nvtop-git/src/nvtop-git/src/interface.c:1251
1251          all_procs.processes[interface->process.selected_row].process->pid;
(gdb) 
XuehaiPan commented 3 years ago

Same issuse for me. On my device, this issuse raises when a single process is using multiple GPUs.

Here is the output of nvtop 2>debug.txt

ASAN:DEADLYSIGNAL
=================================================================
==75378==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fa5a3bf77c6 bp 0x7ffdd51dfc80 sp 0x7ffdd51df3f8 T0)
    #0 0x7fa5a3bf77c5 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5)
    #1 0x7fa5a447febb  (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x3cebb)
    #2 0x40f09c in draw_processes /home/panxuehai/Projects/nvtop/src/interface.c:1257
    #3 0x411dea in draw_gpu_info_ncurses /home/panxuehai/Projects/nvtop/src/interface.c:1685
    #4 0x404e3d in main /home/panxuehai/Projects/nvtop/src/nvtop.c:341
    #5 0x7fa5a3b8c83f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x403c18 in _start (/home/panxuehai/Projects/nvtop/build/src/nvtop+0x403c18)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5) in strlen
==75378==ABORTING

Steps to reproduce:

  1. run nvtop first.

  2. run the following Python code:

$ ipython3
In [1]: import cupy as cp

In [2]: with cp.cuda.Device(0):
   ...:     x = cp.zeros((10000, 1000))
   ...:     

In [2]: with cp.cuda.Device(1):
   ...:     y = cp.zeros((10000, 1000))
   ...:     

If I reverse the order of step 1 and step 2, nvtop will run as expected.

Syllo commented 3 years ago

I think that the patch in the branch fix_segfault should do the trick.

When writing this bit of code I assumed for some reason that there will always be one process running on the GPUs, which is not the case for a server.

Could any of you tell me if the patch solves this issue?

To test you must checkout to the correct branch:

git pull
git checkout fix_segfault
# Build as usual
XuehaiPan commented 3 years ago

The issuse still exists.

Could any of you tell me if the patch solves this issue?

Syllo commented 3 years ago

Can you please provide the error.txt output for this branch?

XuehaiPan commented 3 years ago
../src/interface.c:1266:25: runtime error: null pointer passed as argument 1, which is declared to never be null
AddressSanitizer:DEADLYSIGNAL
=================================================================
==29805==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f164627b7c6 bp 0x7ffe12c16ca0 sp 0x7ffe12c16448 T0)
==29805==The signal is caused by a READ memory access.
==29805==Hint: address points to the zero page.
    #0 0x7f164627b7c6 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6)
    #1 0x7f1647475cdc  (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x3fcdc)
    #2 0x41c740 in draw_processes ../src/interface.c:1266
    #3 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #4 0x407693 in main ../src/nvtop.c:341
    #5 0x7f164621083f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6) in strlen
==29805==ABORTING
XuehaiPan commented 3 years ago

strlen gets a nullptr for all_procs.processes[i].process->user_name here:

https://github.com/Syllo/nvtop/blob/7b0d8e583467cd21a0cb75dcd549a61c209ed902/src/interface.c#L1264-L1269

I think this may be caused by the PID info cache for processes that using multiple GPUs.

https://github.com/Syllo/nvtop/blob/7b0d8e583467cd21a0cb75dcd549a61c209ed902/src/extract_gpuinfo.c#L132-L148

Syllo commented 3 years ago

Thank you.

Another patch has been pushed to fix this bug on the branch fix_segfault.

XuehaiPan commented 3 years ago
=================================================================
==88252==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffeb142b2af at pc 0x7f5ef83ca1e6 bp 0x7ffeb142b060 sp 0x7ffeb142a810
WRITE of size 7 at 0x7ffeb142b2af thread T0
    #0 0x7f5ef83ca1e5 in __interceptor_vsnprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5)
    #1 0x7f5ef83ca3ee in __interceptor_snprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x603ee)
    #2 0x41aa61 in print_processes_on_screen ../src/interface.c:1168
    #3 0x41c900 in draw_processes ../src/interface.c:1273
    #4 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #5 0x407693 in main ../src/nvtop.c:341
    #6 0x7f5ef714483f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #7 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

Address 0x7ffeb142b2af is located in stack of thread T0
SUMMARY: AddressSanitizer: dynamic-stack-buffer-overflow (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5) in __interceptor_vsnprintf
Shadow bytes around the buggy address:
  0x10005627d600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d630: ca ca ca ca 00 02 cb cb cb cb cb cb 00 00 00 00
  0x10005627d640: ca ca ca ca 07 cb cb cb cb cb cb cb 00 00 00 00
=>0x10005627d650: ca ca ca ca 00[07]cb cb cb cb cb cb 00 00 00 00
  0x10005627d660: ca ca ca ca 04 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d670: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==88252==ABORTING

It gets a dynamic stack overflow error when user_name is not NULL for the newest patch.

I get the same stack overflow error when I simply add user_name != NULL check on the previous patch.

if (IS_VALID(gpuinfo_process_user_name_valid,
              all_procs.processes[i].process->valid) &&
    all_procs.processes[i].process->user_name != NULL)
{
  unsigned length = strlen(all_procs.processes[i].process->user_name);
  if (length > largest_username)
    largest_username = length;
}
Syllo commented 3 years ago

It seems unrelated, it was on another part of the program, where the sprintf function could print outside of the buffer. Yet another patch available.

XuehaiPan commented 3 years ago

Error of the third patch:

nvtop: ../src/extract_gpuinfo.c:200: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.

The output adding printf before the assertion:

if (IS_VALID(gpuinfo_total_memory_valid, devices[i].dynamic_info.valid) &&
    IS_VALID(gpuinfo_process_gpu_memory_usage_valid,
              devices[i].processes[j].valid)) {
  float percentage =
      roundf(100.f * (float)devices[i].processes[j].gpu_memory_usage /
              (float)devices[i].dynamic_info.total_memory);
  devices[i].processes[j].gpu_memory_percentage = (unsigned)percentage;
  fprintf(stderr,
          "gpu_memory_usage=%llu  total_memory=%llu  percentage=%f  gpu_memory_percentage=%llu\n",
          devices[i].processes[j].gpu_memory_usage, devices[i].dynamic_info.total_memory,
          percentage, devices[i].processes[j].gpu_memory_percentage);
  assert(devices[i].processes[j].gpu_memory_percentage <= 100);
  SET_VALID(gpuinfo_process_gpu_memory_percentage_valid,
            devices[i].processes[j].valid);
}
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=7961837568  total_memory=11554717696  percentage=69.000000  gpu_memory_percentage=69
gpu_memory_usage=6277824512  total_memory=11554717696  percentage=54.000000  gpu_memory_percentage=54
gpu_memory_usage=1081081856  total_memory=11554717696  percentage=9.000000  gpu_memory_percentage=9
gpu_memory_usage=13744632839234567870  total_memory=11554717696  percentage=118952566784.000000  gpu_memory_percentage=2988449792
nvtop: ../src/extract_gpuinfo.c:204: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.
Syllo commented 3 years ago

Thanks for finding this one. I copied the struct definition from the header, and there was only one version, so I assumed backward compatibility. I double checked for the other functions and the types remained the same.

Is that all there was?

XuehaiPan commented 3 years ago

It works fine on my machine with driver version 430.64 / CUDA 10.1 on Ubuntu 16.04 LTS.

Syllo commented 3 years ago

All right, I merged the patches into master.

Thanks a lot @XuehaiPan for your help fixing these and @TommyJerryMairo for providing the process dump.

Take care

lamhoangtung commented 3 years ago

Thanks u guys for the help