NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.54k stars 585 forks source link

Function not found for nvml methods #331

Open kbkartik opened 1 year ago

kbkartik commented 1 year ago

I have a 1080Ti GPU with CUDA 10.2, NVIDIA driver 440.59 and pynvml version 11.4.1 running on Ubuntu 16.04.

1. I am trying to profile my PyTorch code using scalene. When I run my code as scalene main.py I get the following error:

Error in program being profiled:
 Function Not Found
Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/scalene/scalene_profiler.py", line 1612, in profile_code
    exec(code, the_globals, the_locals)
  File "./code-exp/main.py", line 1, in <module>
    import numpy as np
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/numpy/__init__.py", line 140, in <module>
    from . import core
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/numpy/core/__init__.py", line 22, in <module>
    from . import multiarray
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/numpy/core/multiarray.py", line 12, in <module>
    from . import overrides
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/numpy/core/overrides.py", line 9, in <module>
    from numpy.compat._inspect import getargspec
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/numpy/compat/__init__.py", line 14, in <module>
    from .py3k import *
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/scalene/scalene_profiler.py", line 719, in cpu_signal_handler
    (gpu_load, gpu_mem_used) = Scalene.__gpu.get_stats()
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/scalene/scalene_gpu.py", line 110, in get_stats
    mem_used = self.gpu_memory_usage(self.__pid)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/scalene/scalene_gpu.py", line 101, in gpu_memory_usage
    for proc in pynvml.nvmlDeviceGetComputeRunningProcesses(handle):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2223, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

To validate whether or not this issue is coming from scalene library, I run the following commands:


>>> from pynvml import *
>>> nvmlInit()
>>> nvmlSystemGetDriverVersion()
b'440.59'
>>> handle = nvmlDeviceGetHandleByIndex(0)
>>> nvmlDeviceGetComputeRunningProcesses(handle)
Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2223, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
>>> nvmlDeviceGetGraphicsRunningProcesses(handle)
Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2260, in nvmlDeviceGetGraphicsRunningProcesses
    return nvmlDeviceGetGraphicsRunningProcesses_v2(handle)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2228, in nvmlDeviceGetGraphicsRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetGraphicsRunningProcesses_v2")
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
>>> list(map(str, nvmlDeviceGetGraphicsRunningProcesses(handle)))
Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2260, in nvmlDeviceGetGraphicsRunningProcesses
    return nvmlDeviceGetGraphicsRunningProcesses_v2(handle)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2228, in nvmlDeviceGetGraphicsRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetGraphicsRunningProcesses_v2")
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
>>> nvmlDeviceGetComputeRunningProcesses_v2(handle)
Traceback (most recent call last):
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/home/kube-admin/miniconda3/envs/temporl/lib/python3.8/site-packages/pynvml/nvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

This error seems to be coming from pynvml. I am not sure why this is the case.

2. Steps to reproduce the issue: Install the driver mentioned at the beginning of the post on Ubuntu 16.04. I suppose scalene is not needed.

Common error checking:

Timestamp : Fri Sep 2 21:00:09 2022 Driver Version : 440.59 CUDA Version : 10.2

Attached GPUs : 3 GPU 00000000:02:00.0 Product Name : GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-cdeefa17-d5c4-bf51-15de-048f84c9c228 Minor Number : 0 VBIOS Version : 86.02.39.00.22 MultiGPU Board : No Board ID : 0x200 GPU Part Number : N/A Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:02:00.0 Sub System Id : 0x85E51043 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 18 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11177 MiB Used : 0 MiB Free : 11177 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Temperature GPU Current Temp : 43 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 60.06 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 1480 MHz SM : 1480 MHz Memory : 5508 MHz Video : 1252 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5505 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:03:00.0 Product Name : GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-e4f1baff-dc90-4107-e5e0-6b7bfb9e6d68 Minor Number : 1 VBIOS Version : 86.02.39.00.22 MultiGPU Board : No Board ID : 0x300 GPU Part Number : N/A Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x03 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:03:00.0 Sub System Id : 0x85E51043 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 18 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11178 MiB Used : 0 MiB Free : 11178 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Temperature GPU Current Temp : 43 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 61.03 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 1480 MHz SM : 1480 MHz Memory : 5508 MHz Video : 1252 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5505 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None

GPU 00000000:83:00.0 Product Name : GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-b701a6bb-1912-6f4e-6bbb-8d1f87805edc Minor Number : 3 VBIOS Version : 86.02.39.00.22 MultiGPU Board : No Board ID : 0x8300 GPU Part Number : N/A Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x83 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:83:00.0 Sub System Id : 0x85E51043 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 17 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11178 MiB Used : 0 MiB Free : 11178 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Temperature GPU Current Temp : 40 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 55.13 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 151 MHz SM : 151 MHz Memory : 5508 MHz Video : 734 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5505 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None


 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)
 - [ ] The k8s-device-plugin container logs
 - [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

Additional information that might help better understand your environment and reproduce the bug:
 - [x] Docker version from `docker version`: `Docker version 1.11.2, build b9f10c9`
 - [ ] Docker command, image and tag used
 - [x] Kernel version from `uname -a`: `4.4.0-130-generic #156-Ubuntu`
 - [ ] Any relevant kernel output lines from `dmesg`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
 - [ ] NVIDIA container library version from `nvidia-container-cli -V`
 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
elezar commented 1 year ago

This is most likely due to how pynvml (or this one https://pypi.org/project/nvidia-ml-py/#history) is wrapping the underlying NVML library. It seems to require certain _v2 symbols that are not present in the library associated with your driver version.

I will try to find out where one can ask questions about the bindings.

klueska commented 1 year ago

It's here: https://forums.developer.nvidia.com/c/development-tools/other-tools/system-management-and-monitoring-nvml/128

liuqiba commented 1 year ago

phw89 :"This has now been fixed, as of the 526 driver release." I am try

littletomatodonkey commented 6 months ago

you can downgrade pynvml (such as from 11.5.0 to 11.4.0)