XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.81k stars 150 forks source link

[Bug] Corrupted dependency of version 0.10.0 with `pynvml` #44

Closed lounicotra closed 2 years ago

lounicotra commented 2 years ago

Runtime Environment

Current Behavior

Version 0.10.0 complains about 'pynvml' has no attribute '_nvmlGetFunctionPointer' Here's sequence of working/not working. Servers have the latest versions of py3nvml and pynvml. Dell servers running A100 GPUs. Just built a new Ubuntu 20.04 system running on Dell Poweredge R720 with cuda 11.5 and GTX1080s and I was able to install 0.10.0 with no issues and it is working fine. Thanks.

root@hydra1 ~# nvt
Wed Oct 19 13:57:29 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 0.9.0       Driver Version: 510.47.03      CUDA Driver Version: 11.6 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ MIG M.   Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════╕
│   0  A100-SXM4-80GB      On   │ 00000000:01:00.0 Off │ Disabled           0 │ MEM: █████████▋ 22.4%                                │
│ N/A   28C    P0    69W / 500W │  18380MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   1  A100-SXM4-80GB      On   │ 00000000:41:00.0 Off │ Disabled           0 │ MEM: █████████████████████████████████████████▊ 97%  │
│ N/A   55C    P0   142W / 500W │  77.74GiB / 80.00GiB │     99%      Default │ UTL: ██████████████████████████████████████████▋ 99% │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   2  A100-SXM4-80GB      On   │ 00000000:81:00.0 Off │ Disabled           0 │ MEM: ▍ 1.0%                                          │
│ N/A   34C    P0    61W / 500W │    850MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   3  A100-SXM4-80GB      On   │ 00000000:C1:00.0 Off │ Disabled           0 │ MEM: █████▋ 13.0%                                    │
│ N/A   28C    P0    68W / 500W │  10619MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════╛
[ CPU: █▏ 1.4%                                                                                  ]  ( Load Average:  1.91  1.65  2.59 )
[ MEM: █████▏ 6.1%                                                                              ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                      root@hydra1.som.ma │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM       TIME  COMMAND                                                                 │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0   27113 C    root 10141MiB   0   0.5   0.5  28.0 days  trver --log-verbose=0 --strict-model-config=true --model-repos.. │
│   0   59066 C snaith+  7385MiB   0   0.0   0.8   9.8 days  /home/snani/anaconda3/envs//bin/python3.8 /home/sn.. │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1   13134 C    root 76.91GiB  89  99.7   0.7    9:58:03  python ./scripts/speectext_bpe.py --config-path=/rpice/dgx.. │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   3   41771 C    root  9765MiB   0   0.5   0.4   46:36:44  tritr --log-verbose=0 --strict-model-config=true --model-repos.. │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛
root@hydra1 ~#  pip3 install --upgrade nvitop
Collecting nvitop
  Downloading nvitop-0.10.0-py3-none-any.whl (159 kB)
     |████████████████████████████████| 159 kB 1.0 MB/s
Requirement already satisfied, skipping upgrade: cachetools>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from nvitop) (5.0.0)
Requirement already satisfied, skipping upgrade: nvidia-ml-py<11.516.0a0,>=11.450.51 in /usr/local/lib/python3.8/dist-packages (from nvitop) (11.450.51)
Requirement already satisfied, skipping upgrade: psutil>=5.6.6 in /usr/local/lib/python3.8/dist-packages (from nvitop) (5.9.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from nvitop) (1.1.0)
Installing collected packages: nvitop
  Attempting uninstall: nvitop
    Found existing installation: nvitop 0.9.0
    Uninstalling nvitop-0.9.0:
      Successfully uninstalled nvitop-0.9.0
Successfully installed nvitop-0.10.0
root@hydra1 ~# nvt
Traceback (most recent call last):
  File "/usr/local/bin/nvitop", line 5, in <module>
    from nvitop.cli import main
  File "/usr/local/lib/python3.8/dist-packages/nvitop/__init__.py", line 6, in <module>
    from nvitop import core
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/__init__.py", line 6, in <module>
    from nvitop.core import host, libcuda, libnvml, utils
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 543, in <module>
    __patch_backward_compatibility_layers()
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 539, in __patch_backward_compatibility_layers
    with_mapped_function_name()  # patch first and only for once
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 443, in with_mapped_function_name
    _pynvml._nvmlGetFunctionPointer  # pylint: disable=protected-access
AttributeError: module 'pynvml' has no attribute '_nvmlGetFunctionPointer'
root@hydra1 ~# pip3 install nvitop==0.9.0
Collecting nvitop==0.9.0
  Using cached nvitop-0.9.0-py3-none-any.whl (157 kB)
Requirement already satisfied: psutil>=5.6.6 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (5.9.0)
Requirement already satisfied: cachetools>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (5.0.0)
Requirement already satisfied: termcolor>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (1.1.0)
Requirement already satisfied: nvidia-ml-py<11.500.0a0,>=11.450.51 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (11.450.51)
Installing collected packages: nvitop
  Attempting uninstall: nvitop
    Found existing installation: nvitop 0.10.0
    Uninstalling nvitop-0.10.0:
      Successfully uninstalled nvitop-0.10.0
Successfully installed nvitop-0.9.0
root@hydra1 ~# nvt
Wed Oct 19 14:00:57 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 0.9.0       Driver Version: 510.47.03      CUDA Driver Version: 11.6 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ MIG M.   Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════╕
│   0  A100-SXM4-80GB      On   │ 00000000:01:00.0 Off │ Disabled           0 │ MEM: █████████▋ 22.4%                                │
│ N/A   28C    P0    70W / 500W │  18380MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   1  A100-SXM4-80GB      On   │ 00000000:41:00.0 Off │ Disabled           0 │ MEM: █████████████████████████████████████████▊ 97%  │
│ N/A   55C    P0   346W / 500W │  77.74GiB / 80.00GiB │    100%      Default │ UTL: ███████████████████████████████████████████ MAX │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   2  A100-SXM4-80GB      On   │ 00000000:81:00.0 Off │ Disabled           0 │ MEM: ▍ 1.0%                                          │
│ N/A   34C    P0    61W / 500W │    850MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   3  A100-SXM4-80GB      On   │ 00000000:C1:00.0 Off │ Disabled           0 │ MEM: █████▋ 13.0%                                    │
│ N/A   28C    P0    68W / 500W │  10619MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════╛
[ CPU: █▌ 1.8%                                                                                  ]  ( Load Average:  1.42  1.53  2.35 )
[ MEM: █████▎ 6.2%                                                                              ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                      root@hydra1.som.ma │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM       TIME  COMMAND                                                                 │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   0   27113 C    root 10141MiB   0   0.5   0.5  28.0 days  trir --log-verbose=0 --strict-model-config=true --model-repos.. │
│   0   59066 C snaith+  7385MiB   0   0.0   0.8   9.8 days  /home/snani/anaconda3/envs//home/sn.. │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   1   13134 C    root 76.91GiB  88 100.6   0.7   10:01:31  python ./scripts/stext_bpe.py --config-path=/rprice/dgx.. │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   3   41771 C    root  9765MiB   0   0.5   0.4   46:40:12  tritr --log-verbose=0 --strict-model-config=true --model-repos.. │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛
root@hydra1 ~# logout

Reverting to 0.9.0 fixes the issue.

# Different server with CUDA 11.7
root@hydra4 aide#  pip3 install --upgrade nvitop
Collecting nvitop
  Downloading nvitop-0.10.0-py3-none-any.whl (159 kB)
     |████████████████████████████████| 159 kB 15.3 MB/s
Requirement already satisfied, skipping upgrade: psutil>=5.6.6 in /usr/local/lib/python3.8/dist-packages (from nvitop) (5.9.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from nvitop) (1.1.0)
Requirement already satisfied, skipping upgrade: cachetools>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from nvitop) (5.0.0)
Requirement already satisfied, skipping upgrade: nvidia-ml-py<11.516.0a0,>=11.450.51 in /usr/local/lib/python3.8/dist-packages (from nvitop) (11.450.51)
Installing collected packages: nvitop
  Attempting uninstall: nvitop
    Found existing installation: nvitop 0.9.0
    Uninstalling nvitop-0.9.0:
      Successfully uninstalled nvitop-0.9.0
Successfully installed nvitop-0.10.0
root@hydra4 aide# nvitop
Traceback (most recent call last):
  File "/usr/local/bin/nvitop", line 5, in <module>
    from nvitop.cli import main
  File "/usr/local/lib/python3.8/dist-packages/nvitop/__init__.py", line 6, in <module>
    from nvitop import core
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/__init__.py", line 6, in <module>
    from nvitop.core import host, libcuda, libnvml, utils
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 543, in <module>
    __patch_backward_compatibility_layers()
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 539, in __patch_backward_compatibility_layers
    with_mapped_function_name()  # patch first and only for once
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/libnvml.py", line 443, in with_mapped_function_name
    _pynvml._nvmlGetFunctionPointer  # pylint: disable=protected-access
AttributeError: module 'pynvml' has no attribute '_nvmlGetFunctionPointer'

root@hydra4 aide# pip3 install nvitop==0.9.0
Collecting nvitop==0.9.0
  Using cached nvitop-0.9.0-py3-none-any.whl (157 kB)
Requirement already satisfied: termcolor>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (1.1.0)
Requirement already satisfied: nvidia-ml-py<11.500.0a0,>=11.450.51 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (11.450.51)
Requirement already satisfied: cachetools>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (5.0.0)
Requirement already satisfied: psutil>=5.6.6 in /usr/local/lib/python3.8/dist-packages (from nvitop==0.9.0) (5.9.0)
Installing collected packages: nvitop
  Attempting uninstall: nvitop
    Found existing installation: nvitop 0.10.0
    Uninstalling nvitop-0.10.0:
      Successfully uninstalled nvitop-0.10.0
Successfully installed nvitop-0.9.0
root@hydra4 aide# nvt
Wed Oct 19 14:01:18 2022
╒═════════════════════════════════════════════════════════════════════════════╕
│ NVITOP 0.9.0       Driver Version: 515.43.04      CUDA Driver Version: 11.7 │
├───────────────────────────────┬──────────────────────┬──────────────────────┤
│ GPU  Name        Persistence-M│ Bus-Id        Disp.A │ MIG M.   Uncorr. ECC │
│ Fan  Temp  Perf  Pwr:Usage/Cap│         Memory-Usage │ GPU-Util  Compute M. │
╞═══════════════════════════════╪══════════════════════╪══════════════════════╪══════════════════════════════════════════════════════╕
│   0  A100-SXM4-80GB      On   │ 00000000:01:00.0 Off │  Enabled           0 │ MEM: ▍ 1.0%                                          │
│ N/A   26C    P0    51W / 500W │    854MiB / 80.00GiB │     N/A      Default │ UTL: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ N/A │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│ 0:0      2g.20gb @ GI/CI: 3/0 │     13MiB / 19968MiB │ BAR1:    18MiB /  0% │ MEM: ▏ 0.1%                                          │
│ 0:1      2g.20gb @ GI/CI: 4/0 │     13MiB / 19968MiB │ BAR1:    22MiB /  0% │ MEM: ▏ 0.1%                                          │
│ 0:2      2g.20gb @ GI/CI: 5/0 │     13MiB / 19968MiB │ BAR1:     2MiB /  0% │ MEM: ▏ 0.1%                                          │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   1  A100-SXM4-80GB      On   │ 00000000:41:00.0 Off │ Disabled           0 │ MEM: █████▊ 13.5%                                    │
│ N/A   38C    P0   157W / 500W │  11027MiB / 80.00GiB │     89%      Default │ UTL: ██████████████████████████████████████▎ 89%     │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   2  A100-SXM4-80GB      On   │ 00000000:81:00.0 Off │ Disabled           0 │ MEM: █████▊ 13.5%                                    │
│ N/A   42C    P0   139W / 500W │  11043MiB / 80.00GiB │     90%      Default │ UTL: ██████████████████████████████████████▊ 90%     │
├───────────────────────────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────┤
│   3  A100-SXM4-80GB      On   │ 00000000:C1:00.0 Off │ Disabled           0 │ MEM: ▍ 1.0%                                          │
│ N/A   25C    P0    55W / 500W │    815MiB / 80.00GiB │      0%      Default │ UTL: ▏ 0%                                            │
╘═══════════════════════════════╧══════════════════════╧══════════════════════╧══════════════════════════════════════════════════════╛
[ CPU: ███▎ 3.8%                                                                                ]  ( Load Average:  6.83  7.11  6.50 )
[ MEM: ███▌ 4.2%                                                                                ]  [ SWP: ▏ 0.0%                     ]

╒════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Processes:                                                                                                 root@hydra4.rdct.som.ma │
│ GPU     PID      USER  GPU-MEM %SM  %CPU  %MEM   TIME  COMMAND                                                                     │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│   1   25414 C aljolje 10209MiB  84 100.3   0.5  25:02  /n/redta/rc043h/PYTORCHY/miniconda2/envs/py3.9_torch1.10_cuda11.3/bin/p.. │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│   2   29172 C aljolje 10225MiB  86 100.0   0.5  20:35  /n/redta/rc043h/PYTORCHY/miniconda2/envs/py3.9_torch1.10_cuda11.3/bin/p.. │
╘════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛
root@hydra4 aide# pip3 install --upgrade py3nvml
Requirement already up-to-date: py3nvml in /usr/local/lib/python3.8/dist-packages (0.2.7)
Requirement already satisfied, skipping upgrade: xmltodict in /usr/local/lib/python3.8/dist-packages (from py3nvml) (0.12.0)
root@hydra4 aide# pip3 install --upgrade pynvml
Requirement already up-to-date: pynvml in /usr/local/lib/python3.8/dist-packages (11.4.1)
XuehaiPan commented 2 years ago

Hi, @lounicotra thanks for the feedback. This happens when the dependency package pynvml.py is corrupted. I will add a more informative message for this in a patch release.

Please reinstall nvitop and nvidia-ml-py as:

pip3 install --force-reinstall nvitop nvidia-ml-py

or install nvitop in a new clean virtual environment.


Version 0.10.0 complains about 'pynvml' has no attribute '_nvmlGetFunctionPointer' Here's sequence of working/not working. Servers have the latest versions of py3nvml and pynvml.

All of nvidia-ml-py, nvidia-ml-py3, and pynvml install module pynvml.py. So they are mutually in conflict with each other. You should uninstall pyvnml and force reinstall nvidia-ml-py. Otherwise, please install nvitop in a clean virtual environment (do not install nvidia-ml-py3 and pynvml). Then everything will work as expected.

lounicotra commented 2 years ago

Thanks for looking into this Xuehai!