XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.61k stars 144 forks source link

[Enhancement] Skip error gpus and show normal infos automatically #45

Closed jue-jue-zi closed 1 year ago

jue-jue-zi commented 1 year ago

Runtime Environment

Current Behavior

There are four GPUs on our server. And one of those was overheated for some reasons, which make that GPU cannot be recognized. If run nvidia-smi command without any args to query all the GPUs, error Unable to determine the device handle for GPU 0000:0C:00.0: Unknown Error will show without showing the remaining normal GPUs' infos. But if the command assigns the normal GPUs (nvidia-smi -i 0,1,3), all infos of the normal GPUs can be shown directly.

image image

And if I use nvitop command to show the GPUs' infos, nvidia-ml-py will throw exceptions like this below,

image image

Expected Behavior

I hope that with nvitop command, all the GPUs with errors can be skipped automatically, and show the normal GPUs' infos. If possible, maybe the error GPUs' info can be shown as tips below the normal infos using red fonts for emphasizing.

XuehaiPan commented 1 year ago

@jue-jue-zi Thanks for the feedback! I'll add a quick fix soon.

XuehaiPan commented 1 year ago

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop
jue-jue-zi commented 1 year ago

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

Thanks for fixing it so soon, but it seems that there still exist some problems,

Traceback (most recent call last):
  File "/usr/local/bin/nvitop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvitop/cli.py", line 336, in main
    ui = UI(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/ui.py", line 43, in __init__
    self.main_screen = MainScreen(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/__init__.py", line 38, in __init__
    self.device_panel = DevicePanel(self.devices, compact, win=win, root=root)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 61, in __init__
    self.snapshots = self.take_snapshots()
  File "/usr/local/lib/python3.8/dist-packages/cachetools/func.py", line 62, in wrapper
    v = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in take_snapshots
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in <listcomp>
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/library/device.py", line 70, in as_snapshot
    self._snapshot = super().as_snapshot()
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in as_snapshot
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in <dictcomp>
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 878, in memory_used
    return self.memory_info().used
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/utils.py", line 702, in wrapped
    ret = self._cache[method]  # pylint: disable=protected-access
TypeError: 'function' object is not subscriptable
XuehaiPan commented 1 year ago

but it seems that there still exist some problems,

Fixed by the newest commit.

jue-jue-zi commented 1 year ago

It works right now! Thanks, it is a really great project.

image
jue-jue-zi commented 1 year ago

It works right now! Thanks, it is a really great project.

image

Maybe red fonts for errors would be better.