XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] Display issue when running in Docker environment #98

Closed GhostArtyom closed 11 months ago

GhostArtyom commented 11 months ago

Required prerequisites

What version of nvitop are you using?

nvitop 1.3.0

Operating system and version

Ubuntu 22.04.3 LTS

NVIDIA driver version

537.42

NVIDIA-SMI

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.112                Driver Version: 537.42       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  | 00000000:01:00.0  On |                  N/A |
|  0%   37C    P8              13W / 285W |   1450MiB / 12282MiB |     11%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

Python environment

3.7.5 (default, May 29 2023, 13:54:16) [GCC 7.5.0] linux nvidia-ml-py==12.535.108 nvitop==1.3.0

Problem description

在WSL安装的Docker里面使用nvitop会产生像乱码一样的界面

image

q退出后却能正常显示Unicode字符 image

使用nvitop -U命令用ASCII字符是能正常显示的 image

在WSL中用nvitop就一切正常 image

Steps to Reproduce

The Python snippets (if any):

Command lines:

nvitop

Traceback

No response

Logs

[DEBUG] 2023-09-29 14:23:28,011 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2023-09-29 14:23:28,011 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2023-09-29 14:23:28,022 nvitop.api.libnvml::lookup: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2023-09-29 14:23:28,023 nvitop.api.libnvml::lookup: Found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2023-09-29 14:23:28,023 nvitop.api.libnvml::lookup: Found symbol `nvmlDeviceGetRunningProcessDetailList`.
[DEBUG] 2023-09-29 14:23:28,023 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.

Expected behavior

希望能正确显示Unicode版的nvitop不出乱码

Additional context

已经遍历地试过将LANGLC_ALL设为en_US.UTF-8C.UTF-8,都能稳定复现出乱码

XuehaiPan commented 11 months ago

在 WSL 安装的 Docker 里面使用 nvitop 会产生像乱码一样的界面 已经遍历地试过将 LANGLC_ALL 设为 en_US.UTF-8C.UTF-8,都能稳定复现出乱码

@GhostArtyom 可否提供一下 docker image 的相关信息以供问题复现。以及需要您确认一下是否安装了 ncursesw 库以实现 ncurses 的 Unicode 支持。

GhostArtyom commented 11 months ago

在 WSL 安装的 Docker 里面使用 nvitop 会产生像乱码一样的界面 已经遍历地试过将 LANGLC_ALL 设为 en_US.UTF-8C.UTF-8,都能稳定复现出乱码

@GhostArtyom 可否提供一下 docker image 的相关信息以供问题复现。以及需要您确认一下是否安装了 ncursesw 库以实现 ncurses 的 Unicode 支持。

安装的是 MindSpore 2.1.1 + CUDA 11.6 版本 https://www.mindspore.cn/install/

docker pull swr.cn-south-1.myhuaweicloud.com/mindspore/mindspore-gpu-cuda11.6:2.1.1

image

ncurses 和 ncursesw 库都已安装

apt list | grep ncurses

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

fizmo-ncursesw/jammy 0.7.14-2 amd64
gambas3-gb-ncurses/jammy 3.16.3-3 amd64
lib32ncurses-dev/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64
lib32ncurses6/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64
lib32ncursesw6/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64
libcunit1-ncurses/jammy 2.1-3-dfsg-2.4 amd64
libcunit1-ncurses-dev/jammy 2.1-3-dfsg-2.4 amd64
libncurses-dev/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 amd64 [installed,automatic]
libncurses-gst/jammy 3.2.5-1.3ubuntu1 all
libncurses5/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64 [upgradable from: 6.1-1ubuntu1.18.04.1]
libncurses5-dev/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 amd64 [installed]
libncurses6/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 amd64 [installed,automatic]
libncursesada-doc/jammy 6.2.20200212-4 all
libncursesada6.2.3/jammy 6.2.20200212-4 amd64
libncursesada9-dev/jammy 6.2.20200212-4 amd64
libncursesw5/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64 [upgradable from: 6.1-1ubuntu1.18.04.1]
libncursesw5-dev/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 amd64 [installed]
libncursesw6/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 amd64 [installed,automatic]
librust-ncurses-dev/jammy 5.99.0-3 amd64
ncurses-base/jammy-updates,jammy-security,now 6.3-2ubuntu0.1 all [installed]
ncurses-bin/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64 [upgradable from: 6.1-1ubuntu1.18.04]
ncurses-doc/jammy-updates,jammy-security 6.3-2ubuntu0.1 all
ncurses-examples/jammy-updates,jammy-security 6.3-2ubuntu0.1 amd64
ncurses-hexedit/jammy 0.9.7+orig-7.1 amd64
ncurses-term/jammy-updates,jammy-security 6.3-2ubuntu0.1 all
ruby-ncurses/jammy 1.4.9-1build7 amd64
wordgrinder-ncurses/jammy 0.8-1 amd64
XuehaiPan commented 11 months ago

安装的是 MindSpore 2.1.1 + CUDA 11.6 版本 mindspore.cn/install

docker pull swr.cn-south-1.myhuaweicloud.com/mindspore/mindspore-gpu-cuda11.6:2.1.1

@GhostArtyom 感谢提供相关信息,我 docker container 进入后立刻运行 pip3 install nvitop 可以复现该问题:

$ docker run --gpus=all --rm -it -h ubuntu swr.cn-south-1.myhuaweicloud.com/mindspore/mindspore-gpu-cuda11.6:2.1.1

==========
== CUDA ==
==========

CUDA Version 11.6.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

root@ubuntu:/# pip3 install nvitop
root@ubuntu:/# nvitop
image

我发现是 docker image 内 PATH 默认的 Python interpreter 构建时未使用 ncurses 导致的:

root@ubuntu:/# which -a python3
/usr/local/python-3.7.5/bin/python3
root@ubuntu:/# which -a python
/usr/local/bin/python
/usr/local/bin/python

解决方案如下:

apt update
apt install python3-dev python3-pip
/usr/bin/python3 -m pip install --upgrade pip setuptools
/usr/bin/python3 -m pip install nvitop
image

注:docker 内部的 NVML 会调用 host 系统的 NVIDIA 驱动,返回的 PID 也是 host 系统内的 PID。这导致上面 docker 内出现 No Such Process 错误。如果想显示正确的进程信息,在启动 docker run 命令时需要加上 --pid-host 参数。

GhostArtyom commented 11 months ago

@XuehaiPan 感谢给出解决方案👍 nvitop 太好使了,已经推荐给很多人了

另外 No Such Process 错误或许是由 WSL 没法连接硬件导致的🤔 因为我在 WSL 里用 nvitop 同样是 No Such Process

image

XuehaiPan commented 11 months ago

另外 No Such Process 错误或许是由 WSL 没法连接硬件导致的🤔 因为我在 WSL 里用 nvitop 同样是 No Such Process

该问题是 WSL 上游导致的,参考 issue #49: