[BUG] PIDs are scrambled and `No Such Process` is printed since update to NVIDIA drivers

marcreichman-pfi commented 1 year ago

Required prerequisites

[X] I have read the documentation https://nvitop.readthedocs.io.
[X] I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
[X] I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

git hash 4093334972a334e9057f5acf7661a2c1a96bd021

Operating system and version

Docker image (under Centos 7 host)

NVIDIA driver version

535.54.03

NVIDIA-SMI

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     On  | 00000000:02:00.0 Off |                  N/A |
| 23%   45C    P2              57W / 250W |   2658MiB / 11264MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     On  | 00000000:82:00.0 Off |                  N/A |
| 24%   45C    P2              55W / 250W |   3430MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2863      C   /opt/deepdetect/build/main/dede            1656MiB |
|    0   N/A  N/A      4520      C   /opt/deepdetect/build/main/dede             368MiB |
|    0   N/A  N/A      5001      C   /opt/deepdetect/build/main/dede             630MiB |
|    1   N/A  N/A      3267      C   /opt/deepdetect/build/main/dede             438MiB |
|    1   N/A  N/A      3675      C   /opt/deepdetect/build/main/dede             308MiB |
|    1   N/A  N/A      4072      C   /opt/deepdetect/build/main/dede            2314MiB |
|    1   N/A  N/A      5565      C   /opt/deepdetect/build/main/dede             366MiB |
+---------------------------------------------------------------------------------------+

Python environment

This is the docker version from the latest git head (6/20/2023)

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host --entrypoint /bin/bash nvitop:4093334972a334e9057f5acf7661a2c1a96bd021
(venv) root@ad4380048e10:/nvitop# python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))'
3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] linux
nvidia-ml-py==11.525.112
nvitop @ file:///nvitop

(venv) root@ad4380048e10:/nvitop#

Problem description

The output shows scrambled PIDs for processes after the initial process in the lists for each card, and then shows No Such Process for the wrong PIDs. This only started after the driver update, so I assume something is changed in the nvidia drivers.

Steps to Reproduce

The Python snippets (if any):

Command lines:

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host nvitop:4093334972a334e9057f5acf7661a2c1a96bd021 --once
Tue Jun 20 18:35:07 2023
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.1.2       Driver Version: 535.54.03      CUDA Driver Version: 12.2 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪════════════════════════════════════════════════════════════════════╕
│   0  ..orce GTX 1080 Ti  On   │ 00000000:02:00.0 Off │                  N/A │ MEM: █████████████▍ 23.5%                                          │
│ 28%   42C    P8     9W / 250W │   2650MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼────────────────────────────────────────────────────────────────────┤
│   1  ..orce GTX 1080 Ti  On   │ 00000000:82:00.0 Off │                  N/A │ MEM: █████████████████▍ 30.5%                                      │
│ 29%   44C    P8    10W / 250W │   3430MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧════════════════════════════════════════════════════════════════════╛
[ CPU: ██████████████████████████████████████████████████████████████████████████████████████████████████ MAX ]  ( Load Average: 71.03 39.83 35.33 )
[ MEM: ███████████████▊ 16.1%                                                                   USED: 9.49GiB ]  [ SWP: ▏ 0.0%                     ]

╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                     root@2f027c15efb1 │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM     TIME  COMMAND                                                                                 │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2863 C    1000  1648MiB   0   0.0   1.8  2:15:55  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 a652c745cc9b placeshybrid      │
│   0       0 C     N/A     4KiB   0   N/A   N/A      N/A  No Such Process                                                                         │
│   0 429496. C     N/A       0B   0   N/A   N/A      N/A  No Such Process                                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1    3267 C    1000   438MiB N/A   0.0   1.3  2:15:18  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 bf55e7b22839 inceptionresnetv2 │
│   1       0 C     N/A     4KiB N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 242640. C     N/A      N/A N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 429496. C     N/A       0B N/A   N/A   N/A      N/A  No Such Process                                                                         │
╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Traceback

No response

Logs

$ sudo docker run -it --rm --runtime=nvidia --gpus=all --pid=host -e LOGLEVEL=debug nvitop:4093334972a334e9057f5acf7661a2c1a96bd021 --once
[DEBUG] 2023-06-20 18:35:57,178 nvitop.api.libnvml::nvmlDeviceGetMemoryInfo: NVML memory info version 2 is available.
Tue Jun 20 18:35:57 2023
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 1.1.2       Driver Version: 535.54.03      CUDA Driver Version: 12.2 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ Volatile Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪════════════════════════════════════════════════════════════════════╕
│   0  ..orce GTX 1080 Ti  On   │ 00000000:02:00.0 Off │                  N/A │ MEM: █████████████▍ 23.5%                                          │
│ 24%   35C    P8     8W / 250W │   2650MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼────────────────────────────────────────────────────────────────────┤
│   1  ..orce GTX 1080 Ti  On   │ 00000000:82:00.0 Off │                  N/A │ MEM: █████████████████▍ 30.5%                                      │
│ 25%   36C    P8     9W / 250W │   3430MiB / 11264MiB │      0%      Default │ UTL: ▏ 0%                                                          │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧════════════════════════════════════════════════════════════════════╛
[ CPU: █████████████████████████████████████████████████████████████████████████████████████████████████▊ MAX ]  ( Load Average: 84.50 48.19 38.40 )
[ MEM: ███████████████▋ 15.9%                                                                   USED: 9.36GiB ]  [ SWP: ▏ 0.0%                     ]

╒══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                                     root@333a2a93dbb1 │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM     TIME  COMMAND                                                                                 │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0    2863 C    1000  1648MiB   0   0.0   1.8  2:16:45  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 a652c745cc9b placeshybrid      │
│   0       0 C     N/A     4KiB   0   N/A   N/A      N/A  No Such Process                                                                         │
│   0 429496. C     N/A       0B   0   N/A   N/A      N/A  No Such Process                                                                         │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1    3267 C    1000   438MiB N/A   0.0   1.3  2:16:08  /opt/deepdetect/build/main/dede -host 0.0.0.0 -port 8080 bf55e7b22839 inceptionresnetv2 │
│   1       0 C     N/A     4KiB N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 242640. C     N/A      N/A N/A   N/A   N/A      N/A  No Such Process                                                                         │
│   1 429496. C     N/A       0B N/A   N/A   N/A      N/A  No Such Process                                                                         │
╘══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Expected behavior

Prior to the driver update, the information was present for the same PIDs included in nvidia-smi but with the full commandlines and the per-process resource statistics (e.g. GPU PID USER GPU-MEM %SM %CPU %MEM TIME). Now it seems to be having an issue parsing proper PIDs from the nvidia libraries, and then failing downstream from there.

Additional context

I'm not much of a Python programmer unfortunately so I'm not clear where to dig in, but I'd assume the issue is somewhere in the area of receiving the process list for the cards and deciphering the PIDs. My assumption is that something changed in the driver or some structure or class such that parsing code seems to have broken somewhere.

XuehaiPan commented 1 year ago

Hi @marcreichman-pfi, thanks for raising this. I have encountered the same issue before. I think this would be a bug on the upstream (nvidia-ml-py) with the incompatible NVIDIA driver. The nvidia-ml-py returns invalid PIDs.

In [1]: import pynvml

In [2]: pynvml.nvmlInit()

In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)

In [4]: [p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(handle)]
Out[4]:
[1184,
 0,
 4294967295,
 4294967295,
 16040,
 0,
 4294967295,
 4294967295,
 19984,
 0,
 4294967295,
 4294967295,
 20884,
 0,
 4294967295,
 4294967295,
 26308,
 0,
 4294967295,
 4294967295,
 16336,
 0,
 4294967295,
 4294967295,
 5368,
 0,
 4294967295,
 4294967295,
 19828,
 0,
 4294967295]

I haven't found a solution for this yet. This may be due to an internal API change in the NVML library. We may need to wait for the next nvidia-ml-py release.

As a temporary workaround, you could downgrade your NVIDIA driver version.

XuehaiPan / nvitop