XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] Memory leaking for Nvitop instances inside docker container #128

Open kenvix opened 2 months ago

kenvix commented 2 months ago

Required prerequisites

What version of nvitop are you using?

1.3.2

Operating system and version

Ubuntu 22.04.4 LTS

NVIDIA driver version

535.104.12

NVIDIA-SMI

Wed Jun 26 20:18:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:0E:00.0 Off |                    0 |
|  0%   27C    P8              23W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:0F:00.0 Off |                    0 |
|  0%   40C    P0              75W / 300W |    413MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:12:00.0 Off |                    0 |
|  0%   43C    P0              79W / 300W |  42629MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     Off | 00000000:27:00.0 Off |                    0 |
|  0%   42C    P0              78W / 300W |  42567MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Python environment

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux nvidia-ml-py==12.535.133 nvitop==1.3.2

Problem description

Nvitop has memory leaking issue for instances inside docker container. (RAM, not VRAM). Even the operating system takes ten seconds to reclaim memory after SIGKILL a process.

image

image

Steps to Reproduce

Just keep running nvtop about few months. You'll see nvtop consumed a lot of system memory. About 300GB in 77 days for my instance.

Is this caused by nvitop recorded too much vRAM and GPU utilization information but not releasing it?

Traceback

No response

Logs

No response

Expected behavior

No response

Additional context

No response

XuehaiPan commented 1 month ago

Hi @kenvix, thanks for raising this.

I tested it locally, but I cannot reproduce this. I use a script to create and terminate 10k processes on the GPU.

import time

import ray
import torch

@ray.remote(num_cpus=1, num_gpus=0.1)
def request_gpu():
    torch.zeros(1000, device='cuda')
    time.sleep(10)

ray.init()
_ = ray.get([request_gpu.remote() for _ in range(10000)])

The memory consumption of nvitop is stable around 260M.

image
kenvix commented 1 month ago

Hi, @XuehaiPan

The test code you provided does not seem relevant to this issue. In my case, using tmux or screen to keep nvitop running, you will find that the memory (RAM, not GPU VRAM) usage of nvitop itself will continue to slowly increase over time.

For my example below, I ran it for 12 hours:

image

It used about 4.5G RAM

XuehaiPan commented 1 month ago

@kenvix could you test nvitop using pipx? Maybe it is caused by a dependency (e.g., the unofficial pynvml package).

Running after 2 days:

image