XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] nvidia-ml-py-12.535.77 兼容性問題 #90

Closed hui-zhao-1 closed 1 year ago

hui-zhao-1 commented 1 year ago

Required prerequisites

What version of nvitop are you using?

1.2.0

Operating system and version

Ubuntu 20.04.6 LTS (Focal Fossa)

NVIDIA driver version

535.86.10

NVIDIA-SMI

Thu Aug 17 16:23:29 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 27%   46C    P2              39W / 120W |     75MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    522690      C   ./nvitop-test                                72MiB |
+---------------------------------------------------------------------------------------+

Python environment

3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] linux nvidia-ml-py==12.535.77 nvitop==1.2.0

Problem description

测试机器使用的 nvidia driver https://international.download.nvidia.com/tesla/535.86.10/NVIDIA-Linux-x86_64-535.86.10.run 在该 版本 driver 下,运行 nvitop 无法正常看到运行中的进程: image image

Steps to Reproduce

排查发现,nvitop 报了这个错误: image

Traceback

ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetComputeRunningProcesses at 0x7f2b7be24670>, *args, **kwargs).
Please verify whether the `nvidia-ml-py` package is compatible with your NVIDIA driver version.
ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f2b7be24700>, *args, **kwargs).
Please verify whether the `nvidia-ml-py` package is compatible with your NVIDIA driver version.

Logs

No response

Expected behavior

No response

Additional context

No response

XuehaiPan commented 1 year ago

Duplicate #88, would be fixed by #89.

hui-zhao-1 commented 1 year ago

排查日志,怀疑是 https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/libnvml.py line 590 的 __determine_get_running_processes_version_suffix() 这个方法有问题 我这边没有看懂这里为什么要通过 'nvmlDeviceGetConfComputeMemSizeInfo' 判断 版本号,所以 fork 代码把这个判断注释以后,解决了这个问题,参考:https://github.com/XuehaiPan/nvitop/commit/cc3ad6da513062cab1759267fb80a028d74c2f32

hui-zhao-1 commented 1 year ago

已经通过 pip3 install git+https://github.com/XuehaiPan/nvitop.git 验证 问题解决了