XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[Bug] Processes information cannot be obtained normally on 535.98 driver #88

Closed GeekRaw closed 1 year ago

GeekRaw commented 1 year ago

Required prerequisites

Questions

Hello, when I use nvitop on the server, I can't get the Processes information normally, thank you for your answer

image

XuehaiPan commented 1 year ago

Hi @GeekRaw, could you provide some relevant information, such as nvidia-smi output and the package version list of your Python environment? It would also be helpful whether you are running nvitop natively or in a container-like environment. Then we can investigate this issue deeper.

cfroehli commented 1 year ago

Hello,

If that may help, we noticed the same behavior recently too as we upgraded our drivers version (currently on 536.86.10, Ubuntu 20.04 cuda 12.2). Card model seems not relevant. The load and chart on the top is matching the nvidia-smi output, but the process list is broken. nvidia-smi is able to show the actual processes. Install is a basic python3 venv on the actual server, no container involved.

Depending on the tty refresh/timing, it is possible to see an ERROR: A FunctionNotFound error occured while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f08ff962940>, *args, **kwargs). Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version getting printed (often get overwritten so easy to miss). Guess some api changed in a recent nvidia-ml version.

------------- ---------
cachetools    5.3.1
nvidia-ml-py  12.535.77
nvitop        1.2.0
pip           23.2.1
pkg_resources 0.0.0
psutil        5.9.5
setuptools    68.0.0
termcolor     2.3.0
wheel         0.41.0

Downgrading it to some of the latest 11.* version didn't help.

$python 
Python 3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> nvmlSystemGetDriverVersion()
'535.86.10'
>>> handle = nvmlDeviceGetHandleByIndex(0)
>>> nvmlDeviceGetComputeRunningProcesses(handle)
Traceback (most recent call last):
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 913, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2775, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2741, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 916, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found

$ nm -gD /lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep Running
000000000006cfc0 T nvmlDeviceGetComputeRunningProcesses
000000000006d1b0 T nvmlDeviceGetComputeRunningProcesses_v2
000000000006d3a0 T nvmlDeviceGetGraphicsRunningProcesses
000000000006d590 T nvmlDeviceGetGraphicsRunningProcesses_v2
000000000006d780 T nvmlDeviceGetMPSComputeRunningProcesses
000000000006d970 T nvmlDeviceGetMPSComputeRunningProcesses_v2

Seems the _v3 is not there anymore but the python bindings keep using it ?

XuehaiPan commented 1 year ago

@cfroehli Thanks for the feedback! This is due to poor version management for the NVML library.

The v3 APIs were introduced in the 510.39.01 driver: https://github.com/NVIDIA/nvidia-settings/commit/b2f0e7f437c42d92ed58120ec8d880f5f4b90d60

but they were removed in the 535.98 driver: https://github.com/NVIDIA/nvidia-settings/commit/0cb3beffa0cb8a1f8cb405291b11a1e2eb7a4786


Version change:

495.46 -> 510.39.01: https://github.com/NVIDIA/nvidia-settings/commit/b2f0e7f437c42d92ed58120ec8d880f5f4b90d60

530.41.03 -> 535.43.02: https://github.com/NVIDIA/nvidia-settings/commit/39c3e28e84f3ffb034abaf1ae92dbb570c207d05

535.86.05 -> 535.98: https://github.com/NVIDIA/nvidia-settings/commit/0cb3beffa0cb8a1f8cb405291b11a1e2eb7a4786

UPDATE:

535.98 -> 535.104.05: https://github.com/NVIDIA/nvidia-settings/commit/74cae7fa6a3da595a1bd87918ef0a67bb4326925

XuehaiPan commented 1 year ago

Hi @cfroehli @GeekRaw, I created a new PR to resolve this. You could try:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@fix-process-api

Let me know if this works for you.

cfroehli commented 1 year ago

That fixes the process listing in my case. (thanks for the quick fix and the nice tool btw)

XuehaiPan commented 1 year ago

@cfroehli Thanks for the feedback. A new version with the fix will release soon.

XuehaiPan commented 1 year ago

Hi, the NVIDIA driver upstream re-add the v3 APIs back in the last driver release:

535.98 -> 535.104.05: https://github.com/NVIDIA/nvidia-settings/commit/74cae7fa6a3da595a1bd87918ef0a67bb4326925

nvitop 1.2.0 will work fine if you upgrade your NVIDIA driver to 535.104.05.