XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.79k stars 149 forks source link

[BUG] PIDs are scrambled and `No Such Process` is printed since update to NVIDIA drivers #75

Closed marcreichman-pfi closed 1 year ago

marcreichman-pfi commented 1 year ago

Required prerequisites

What version of nvitop are you using?

git hash 4093334972a334e9057f5acf7661a2c1a96bd021

Operating system and version

Docker image (under Centos 7 host)

NVIDIA driver version

535.54.03

NVIDIA-SMI

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     On  | 00000000:02:00.0 Off |                  N/A |
| 23%   45C    P2              57W / 250W |   2658MiB / 11264MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     On  | 00000000:82:00.0 Off |                  N/A |
| 24%   45C    P2              55W / 250W |   3430MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2863      C   /opt/deepdetect/build/main/dede            1656MiB |
|    0   N/A  N/A      4520      C   /opt/deepdetect/build/main/dede             368MiB |
|    0   N/A  N/A      5001      C   /opt/deepdetect/build/main/dede             630MiB |
|    1   N/A  N/A      3267      C   /opt/deepdetect/build/main/dede             438MiB |
|    1   N/A  N/A      3675      C   /opt/deepdetect/build/main/dede             308MiB |
|    1   N/A  N/A      4072      C   /opt/deepdetect/build/main/dede            2314MiB |
|    1   N/A  N/A      5565      C   /opt/deepdetect/build/main/dede             366MiB |
+---------------------------------------------------------------------------------------+

Python environment

This is the docker version from the latest git head (6/20/2023)

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host --entrypoint /bin/bash nvitop:4093334972a334e9057f5acf7661a2c1a96bd021
(venv) root@ad4380048e10:/nvitop# python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))'
3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] linux
nvidia-ml-py==11.525.112
nvitop @ file:///nvitop

(venv) root@ad4380048e10:/nvitop#

Problem description

The output shows scrambled PIDs for processes after the initial process in the lists for each card, and then shows No Such Process for the wrong PIDs. This only started after the driver update, so I assume something is changed in the nvidia drivers.

Steps to Reproduce

The Python snippets (if any):

Command lines:

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host nvitop:4093334972a334e9057f5acf7661a2c1a96bd021 --once
Tue Jun 20 18:35:07 2023
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.1.2       Driver Version: 535.54.03      CUDA Driver Version: 12.2 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪════════════════════════════════════════════════════════════════════╕
│   0  ..orce GTX 1080 Ti  On   │ 00000000:02:00.0 Off │                  N/A │ MEM: █████████████▍ 23.5%                                          │
│ 28%   42C    P8     9W / 250W │   2650MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼────────────────────────────────────────────────────────────────────┤
│   1  ..orce GTX 1080 Ti  On   │ 00000000:82:00.0 Off │                  N/A │ MEM: █████████████████▍ 30.5%                                      │
│ 29%   44C    P8    10W / 250W │   3430MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧════════════════════════════════════════════════════════════════════╛
[ CPU: ██████████████████████████████████████████████████████████████████████████████████████████████████ MAX ]  ( Load Average: 71.03 39.83 35.33 )
[ MEM: ███████████████▊ 16.1%                                                                   USED: 9.49GiB ]  [ SWP: ▏ 0.0%                     ]

╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                     root@2f027c15efb1 │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM     TIME  COMMAND                                                                                 │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2863 C    1000  1648MiB   0   0.0   1.8  2:15:55  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 a652c745cc9b placeshybrid      │
│   0       0 C     N/A     4KiB   0   N/A   N/A      N/A  No Such Process                                                                         │
│   0 429496. C     N/A       0B   0   N/A   N/A      N/A  No Such Process                                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1    3267 C    1000   438MiB N/A   0.0   1.3  2:15:18  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 bf55e7b22839 inceptionresnetv2 │
│   1       0 C     N/A     4KiB N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 242640. C     N/A      N/A N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 429496. C     N/A       0B N/A   N/A   N/A      N/A  No Such Process                                                                         │
╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Traceback

No response

Logs

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host -e LOGLEVEL=debug nvitop:4093334972a334e9057f5acf7661a2c1a96bd021 --once
[DEBUG] 2023-06-20 18:35:57,178 nvitop.api.libnvml::nvmlDeviceGetMemoryInfo: NVML memory info version 2 is available.
Tue Jun 20 18:35:57 2023
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.1.2       Driver Version: 535.54.03      CUDA Driver Version: 12.2 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪════════════════════════════════════════════════════════════════════╕
│   0  ..orce GTX 1080 Ti  On   │ 00000000:02:00.0 Off │                  N/A │ MEM: █████████████▍ 23.5%                                          │
│ 24%   35C    P8     8W / 250W │   2650MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼────────────────────────────────────────────────────────────────────┤
│   1  ..orce GTX 1080 Ti  On   │ 00000000:82:00.0 Off │                  N/A │ MEM: █████████████████▍ 30.5%                                      │
│ 25%   36C    P8     9W / 250W │   3430MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧════════════════════════════════════════════════════════════════════╛
[ CPU: █████████████████████████████████████████████████████████████████████████████████████████████████▊ MAX ]  ( Load Average: 84.50 48.19 38.40 )
[ MEM: ███████████████▋ 15.9%                                                                   USED: 9.36GiB ]  [ SWP: ▏ 0.0%                     ]

╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                     root@333a2a93dbb1 │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM     TIME  COMMAND                                                                                 │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2863 C    1000  1648MiB   0   0.0   1.8  2:16:45  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 a652c745cc9b placeshybrid      │
│   0       0 C     N/A     4KiB   0   N/A   N/A      N/A  No Such Process                                                                         │
│   0 429496. C     N/A       0B   0   N/A   N/A      N/A  No Such Process                                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1    3267 C    1000   438MiB N/A   0.0   1.3  2:16:08  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 bf55e7b22839 inceptionresnetv2 │
│   1       0 C     N/A     4KiB N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 242640. C     N/A      N/A N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 429496. C     N/A       0B N/A   N/A   N/A      N/A  No Such Process                                                                         │
╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Expected behavior

Prior to the driver update, the information was present for the same PIDs included in nvidia-smi but with the full commandlines and the per-process resource statistics (e.g. GPU PID USER GPU-MEM %SM %CPU %MEM TIME). Now it seems to be having an issue parsing proper PIDs from the nvidia libraries, and then failing downstream from there.

Additional context

I'm not much of a Python programmer unfortunately so I'm not clear where to dig in, but I'd assume the issue is somewhere in the area of receiving the process list for the cards and deciphering the PIDs. My assumption is that something changed in the driver or some structure or class such that parsing code seems to have broken somewhere.

XuehaiPan commented 1 year ago

Hi @marcreichman-pfi, thanks for raising this. I have encountered the same issue before. I think this would be a bug on the upstream (nvidia-ml-py) with the incompatible NVIDIA driver. The nvidia-ml-py returns invalid PIDs.

In [1]: import pynvml

In [2]: pynvml.nvmlInit()

In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)

In [4]: [p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(handle)]
Out[4]:
[1184,
 0,
 4294967295,
 4294967295,
 16040,
 0,
 4294967295,
 4294967295,
 19984,
 0,
 4294967295,
 4294967295,
 20884,
 0,
 4294967295,
 4294967295,
 26308,
 0,
 4294967295,
 4294967295,
 16336,
 0,
 4294967295,
 4294967295,
 5368,
 0,
 4294967295,
 4294967295,
 19828,
 0,
 4294967295]

I haven't found a solution for this yet. This may be due to an internal API change in the NVML library. We may need to wait for the next nvidia-ml-py release.

As a temporary workaround, you could downgrade your NVIDIA driver version.

See also:

marcreichman-pfi commented 1 year ago

Hi @XuehaiPan and thanks for your response and excellent tool!

We cannot downgrade because we need newer CUDA version support, so for now we'll just have to wait for an updated version with the NVML library fix.

XuehaiPan commented 1 year ago

Hi @marcreichman-pfi, a new release of nvidia-ml-py with version 12.535.77 came out several hours ago. You can upgrade your nvidia-ml-py package with the command:

python3 -m pip install --upgrade nvidia-ml-py

This would resolve the unrecognized PIDs with CUDA 12 drivers.

I would also make a new release of nvitop to resolve CUDA 12 driver support.

marcreichman-pfi commented 1 year ago

Thanks @XuehaiPan - is there a way to do this in the docker version?

XuehaiPan commented 1 year ago

Thanks @XuehaiPan - is there a way to do this in the docker version?

@marcreichman-pfi You could upgrade nvidia-ml-py in your docker container.

marcreichman-pfi commented 1 year ago

Thanks this did the trick! Here was what I did from your Dockerfile:

$ git diff Dockerfile
diff --git a/Dockerfile b/Dockerfile
index c3194cf..96874da 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -32,6 +32,7 @@ COPY . /nvitop
 WORKDIR /nvitop
 RUN . /venv/bin/activate && \
   python3 -m pip install . && \
+  python3 -m pip install --upgrade nvidia-ml-py && \
   rm -rf /root/.cache

 # Entrypoint
ukejeb commented 2 months ago

nvitop-1.3.2 with nvidia-ml-py-12.535.161, CUDA 12.2 and Driver Version 535.129.03 also shows No Such Process.