Closed marcreichman-pfi closed 1 year ago
Hi @marcreichman-pfi, thanks for raising this. I have encountered the same issue before. I think this would be a bug on the upstream (nvidia-ml-py
) with the incompatible NVIDIA driver. The nvidia-ml-py
returns invalid PIDs.
In [1]: import pynvml
In [2]: pynvml.nvmlInit()
In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)
In [4]: [p.pid for p in pynvml.nvmlDeviceGetComputeRunningProcesses(handle)]
Out[4]:
[1184,
0,
4294967295,
4294967295,
16040,
0,
4294967295,
4294967295,
19984,
0,
4294967295,
4294967295,
20884,
0,
4294967295,
4294967295,
26308,
0,
4294967295,
4294967295,
16336,
0,
4294967295,
4294967295,
5368,
0,
4294967295,
4294967295,
19828,
0,
4294967295]
I haven't found a solution for this yet. This may be due to an internal API change in the NVML library. We may need to wait for the next nvidia-ml-py
release.
As a temporary workaround, you could downgrade your NVIDIA driver version.
See also:
Hi @XuehaiPan and thanks for your response and excellent tool!
We cannot downgrade because we need newer CUDA version support, so for now we'll just have to wait for an updated version with the NVML library fix.
Hi @marcreichman-pfi, a new release of nvidia-ml-py
with version 12.535.77 came out several hours ago. You can upgrade your nvidia-ml-py
package with the command:
python3 -m pip install --upgrade nvidia-ml-py
This would resolve the unrecognized PIDs with CUDA 12 drivers.
I would also make a new release of nvitop
to resolve CUDA 12 driver support.
Thanks @XuehaiPan - is there a way to do this in the docker version?
Thanks @XuehaiPan - is there a way to do this in the docker version?
@marcreichman-pfi You could upgrade nvidia-ml-py
in your docker container.
Thanks this did the trick! Here was what I did from your Dockerfile
:
$ git diff Dockerfile
diff --git a/Dockerfile b/Dockerfile
index c3194cf..96874da 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -32,6 +32,7 @@ COPY . /nvitop
WORKDIR /nvitop
RUN . /venv/bin/activate && \
python3 -m pip install . && \
+ python3 -m pip install --upgrade nvidia-ml-py && \
rm -rf /root/.cache
# Entrypoint
nvitop-1.3.2
with nvidia-ml-py-12.535.161
, CUDA 12.2
and Driver Version 535.129.03
also shows No Such Process
.
Required prerequisites
What version of nvitop are you using?
git hash
4093334972a334e9057f5acf7661a2c1a96bd021
Operating system and version
Docker image (under Centos 7 host)
NVIDIA driver version
535.54.03
NVIDIA-SMI
Python environment
This is the docker version from the latest git head (6/20/2023)
Problem description
The output shows scrambled PIDs for processes after the initial process in the lists for each card, and then shows
No Such Process
for the wrong PIDs. This only started after the driver update, so I assume something is changed in the nvidia drivers.Steps to Reproduce
The Python snippets (if any):
Command lines:
Traceback
No response
Logs
Expected behavior
Prior to the driver update, the information was present for the same PIDs included in
nvidia-smi
but with the full commandlines and the per-process resource statistics (e.g.GPU PID USER GPU-MEM %SM %CPU %MEM TIME
). Now it seems to be having an issue parsing proper PIDs from the nvidia libraries, and then failing downstream from there.Additional context
I'm not much of a Python programmer unfortunately so I'm not clear where to dig in, but I'd assume the issue is somewhere in the area of receiving the process list for the cards and deciphering the PIDs. My assumption is that something changed in the driver or some structure or class such that parsing code seems to have broken somewhere.