XuehaiPan / nvitop

An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
https://nvitop.readthedocs.io
Apache License 2.0
4.56k stars 144 forks source link

[BUG] nvidia-smi pmon 和 nvitop -o 输出的 sm % 不一致且有较大差异 #83

Closed hui-zhao-1 closed 1 year ago

hui-zhao-1 commented 1 year ago

Required prerequisites

What version of nvitop are you using?

1.2.0

Operating system and version

CentOS Linux 7 (Core)

NVIDIA driver version

470.129.06

NVIDIA-SMI

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   53C    P0   254W / 300W |  17231MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   54C    P0   192W / 300W |  15995MiB / 32510MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   41C    P0    70W / 300W |  10499MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   37C    P0    69W / 300W |  10981MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   50C    P0   273W / 300W |  18073MiB / 32510MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   49C    P0   241W / 300W |  10141MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   41C    P0    71W / 300W |  10499MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   36C    P0    70W / 300W |   6493MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Python environment

3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] linux gpustat==1.1 nvidia-ml-py==11.470.66 nvitop==1.2.0

Problem description

分別使用 nvidia-smi pmon -i 7 和 nvitop -o 7 两个命令,观察 7 号 gpu 上的 sm% 发现两者有较大差别

nvidia-smi pmon -i 7 显示,大部分时间 sm% 为 0 偶尔出现 几个不为0 的点,且每次都超过 10% image

nvitop -o 7 显示,大部分时间的使用率都为 10% 以下,偶尔出现几个 0% image

查看了源代码,怀疑跟 https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 有关 这里计算时间戳的时候,有个 - 2_000_000 的操作

我单独抽取这部分代码,写了一个测试用例,发现, - 2_000_000 会对查询出的 sm% 有影响

Steps to Reproduce

import schedule
import time
import pynvml
timestamp = 0
def test():
    global timestamp
    gpu_device_count = pynvml.nvmlDeviceGetCount()
    for gpu_index in range(gpu_device_count):
        if gpu_index != 7:
            continue
        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
        try:
            processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp)
            for process in processes_util:
                print(gpu_index,str(process.pid),process.smUtil,process.timeStamp)
                timestamp = process.timeStamp - 2_000_000
                local_time = time.localtime(timestamp /1000 /1000)
                time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time)
            print("==============================",time_format,len(processes_util))
        except pynvml.NVMLError_NotFound:
            continue

if __name__ == "__main__":
    pynvml.nvmlInit()
    schedule.every(2).seconds.do(test)
    while True:
        schedule.run_pending()
        time.sleep(2)

Traceback

当保留- 2_000_000 的时候,打印出的时间戳有很多重复项,且间隔较大:
删除 - 2_000_000 的时候,打印的时间戳是正常的

Logs

# 保留- 2_000_000 的日志:
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204702492
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960221427376
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427376
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427375
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427375
============================== 2023-08-02 15:10:19 1
-------------------------------------------------
# 删除- 2_000_000 的日志:
7 212485 0 1690960372104743
============================== 2023-08-02 15:12:52 1
7 212485 0 1690960381974198
============================== 2023-08-02 15:13:01 1
7 212485 0 1690960383981202
============================== 2023-08-02 15:13:03 1
7 212485 0 1690960385988421
============================== 2023-08-02 15:13:05 1
7 212485 0 1690960387995655
============================== 2023-08-02 15:13:07 1
7 212485 0 1690960388832086
============================== 2023-08-02 15:13:08 1
7 212485 0 1690960392010563
============================== 2023-08-02 15:13:12 1

Expected behavior

我没有看懂,https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 计算时间戳的时候,为什么要 - 2_000_000 ,所以不确定我的理解对不对,也不确定是不是跟系统或者版本有关 我只是发现 nvidia-smi pmon -i 7 和 nvitop -o 7 的输出不一致,期望的输出应该是一致的

Additional context

https://www.clear.rice.edu/comp422/resources/cuda/pdf/nvml.pdf 的 155 页解释了 nvmlDeviceGetProcessUtilization 的入参中,时间戳的含义。 我的猜测是,nvidia 维护了一个 最近 n 秒的 sm % 的 buff,查询的时候 如果传递的timestamp 是 0 就会返回buff 里时间戳最小的一条记录 如果传递的tiemstamp 为 x,就会返回 大于等于 x 的 时间戳最小的记录 所以,如果每次查询出的时间戳,都 -2_000_000 的话,下次查询出来的还是相同的记录 直到 buff 满了,把之前查询的记录刷掉了,才会返回此时 buffer 的最新数据

纯属猜想,辛苦大佬帮忙解答,万分感谢!! 感谢大佬作出如此好用的工具!!

XuehaiPan commented 1 year ago

@2581543189 感谢提问!

我没有看懂,main/nvitop/api/device.py line 1706 计算时间戳的时候,为什么要 - 2_000_000 ,所以不确定我的理解对不对,也不确定是不是跟系统或者版本有关

https://github.com/XuehaiPan/nvitop/blob/ec53de75b4579c319eb6e6b5c1e906d5cb90b561/nvitop/api/device.py#L1699-L1714

这里额外减 2_000_000 (即 2 秒),是为了使得每次 API 调用都尽可能有 sample 返回。这一操作确实会导致该 utilization rate 没法完全反应实时值。另外若某个 pid 对应的进程无 sample 返回,则在 1714 行会将所有 utilization rate 置为 0。若将 timeStamp 设置得过高,可能导致 GPU 有搜集到 sample 但无返回值的情况。

注:根据 man nvidia-smi 的文档,GPU Utilization 采样率为 1-1/6 秒,估计 Process Utilization 的采样率也在差不多的量级上。

NVIDIA NVML 文档:GRID Virtualization APIs nvmlDeviceGetProcessUtilization

hui-zhao-1 commented 1 year ago

这里额外减 2_000_000 (即 2 秒),是为了使得每次 API 调用都尽可能有 sample 返回

我想用一个例子说明这样做的坏处:

// nvitop-test.cu
//
// nvcc nvitop-test.cu -o nvitop-test -std=c++11
//
#include<stdio.h>
#include<thread>
#include<chrono>
#include<iostream>
#include<cuda_runtime.h>

void sleep(int milliseconds) {
        std::cout << "start sleep()" << milliseconds << " ms" << std::endl;
        auto start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::milliseconds(milliseconds));
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double, std::milli> elapsed = end - start;
        std::cout << "stop sleep(): " << elapsed.count() << " ms" << std::endl;
}

void initialData(float* ip, int size) {
        // generate different seed for random number
        time_t t;
        srand((unsigned)time(&t));
        for (int i = 0; i < size; i++) {
                ip[i] = (float)(rand() & 0xFF) / 10.0f;
        }
}

__global__ void testMaxFlopsKernel(float* pData, long nRepeats, float v1, float v2)
{
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        float s = pData[tid], s2 = 10.0f - s, s3 = 9.0f - s, s4 = 9.0f - s2;
        for (long i = 0; i < nRepeats; i++)
        {
                s = v1 - s * v2;
                s2 = v1 - s * v2;
                s3 = v1 - s2 * v2;
                s4 = v1 - s3 * v2;
        }
        pData[tid] = ((s + s2) + (s3 + s4));
}

int main(int argc, char** argv) {
        // set up device
        int dev = 0;
        cudaSetDevice(dev);

        // set up data size of vectors
        int nElem = 1;
        printf("Vector size %d\n", nElem);
        long nRepeats = 1000000000;
        printf("nRepeats %ld\n", nRepeats);

        // malloc host memory
        size_t nBytes = nElem * sizeof(float);
        float* h_pData;
        h_pData = (float*)malloc(nBytes);

        // initialize data at host side
        initialData(h_pData, nElem);

        // malloc device global memory
        float* d_pData;
        cudaMalloc((float**)&d_pData, nBytes);

        // transfer data from host to device
        cudaMemcpy(d_pData, h_pData, nBytes, cudaMemcpyHostToDevice);

        // invoke kernel at host side
        dim3 block(1, 1, 1);
        dim3 grid(1, 1, 1);

        int index = 0;
        for (index = 0; index <= 1000000; index++) {

                std::cout << "start testMaxFlopsKernel()" << std::endl;
                auto start = std::chrono::steady_clock::now();
                testMaxFlopsKernel << < grid, block >> > (d_pData, nRepeats, 1.0f, 2.0f);
                cudaMemcpy(h_pData, d_pData, nBytes, cudaMemcpyDeviceToHost);
                auto end = std::chrono::steady_clock::now();
                auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
                double time = ms.count();
                std::cout << "stop testMaxFlopsKernel(): " << time << " ms" << std::endl;
                sleep(10000);
        }
        cudaFree(d_pData);
        free(h_pData);
        return(0);
}

上面是一个 cuda 程序,可以通过 nvcc nvitop-test.cu -o nvitop-test -std=c++11 进行编译 这个程序的逻辑是,休眠 10S 然后提交一个 kernel 到 gpu 去运行,我的测试机 v100 执行这个kernel 会花费 2s 对应的日志是:

start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2286 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2317 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2299 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms

然后,我写一个 python 程序,把 nvitop 的 sm% 信息收集到 prometheus 中,用grafana 展示出曲线图来说明问题 具体采集监控的代码如下:

cat <<EOF | tee /etc/apt/sources.list
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
EOF

export http_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128
export https_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128
export no_proxy=localhost,127.0.0.1,.sensetime.com,.pjlab.org.cn,

pip install flask
pip install schedule
pip install nvitop
# prometheus.py

import os
import re
import threading
from flask import Response, Flask
import prometheus_client
from prometheus_client import Gauge,CollectorRegistry
import asyncio
import schedule
import time
import socket
from nvitop.api import libnvml
from nvitop.gui import Device, colored
import sys
import json

ip_addr=socket.gethostbyname(socket.gethostname())
registry = CollectorRegistry(auto_describe=False)
device_count = 0

def doUpdateMetrics():
    global registry
    newRegistry = CollectorRegistry(auto_describe=False)
    gpu_pid_sm_util =  Gauge("gpu_pid_sm_util", "gpu_pid_sm_util",["ip_addr","gpu_index","pid"], registry=newRegistry)
    gpu_pid_mem_used =  Gauge("gpu_pid_mem_used", "gpu_pid_mem_used",["ip_addr","gpu_index","pid"], registry=newRegistry)
    gpu_pid_mem_total =  Gauge("gpu_pid_mem_total", "gpu_pid_mem_total",["ip_addr","gpu_index","pid"], registry=newRegistry)

    mem_total={}
    sm_util={}
    mem_used={}
    indices = set(range(device_count))
    devices = Device.from_indices(sorted(indices))

    for device in devices:
        mem_total[str(device.index)] = int(device.memory_total() / 1024 / 1024)
        processes = device.processes().values()
        for process in processes:
            sm_util[(str(device.index),str(process.pid))] = process.gpu_sm_utilization()
            mem_used[(str(device.index),str(process.pid))] = int(process._gpu_memory / 1024 / 1024)

    for key in sm_util:
        pid = key[1]
        gpu_index = key[0]
        util = sm_util[key]
        gpu_pid_sm_util.labels(ip_addr,gpu_index,pid).set(util)

    for key in mem_used:
        pid = key[1]
        gpu_index = key[0]
        if gpu_index not in mem_total:
            continue
        total = mem_total[gpu_index]
        used = mem_used[key]
        gpu_pid_mem_total.labels(ip_addr,gpu_index,pid).set(total)
        gpu_pid_mem_used.labels(ip_addr,gpu_index,pid).set(used)
    registry = newRegistry

def updateMetrics():
    global device_count
    loop =  asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        device_count = Device.count()
    except libnvml.NVMLError_LibraryNotFound:
        print("libnvml.NVMLError_LibraryNotFound")
        return
    except libnvml.NVMLError as ex:
        print(
            '{} {}'.format(colored('NVML ERROR:', color='red', attrs=('bold',)), ex),
            file=sys.stderr,
        )
        return
    schedule.every(2).seconds.do(doUpdateMetrics)
    while True:
        schedule.run_pending()
        time.sleep(2)

app = Flask(__name__)
@app.route("/metrics")
def metrics():
    return Response(prometheus_client.generate_latest(registry),mimetype="text/plain")

if __name__ == "__main__":
    thread1 = threading.Thread(target=updateMetrics)
    thread1.start()
    app.run(host="0.0.0.0",port=5000)

采集完信息以后,这个进程的 sm 使用率 监控如下图所示:

image

hui-zhao-1 commented 1 year ago

这个程序明明是 休眠 10S 然后工作 2S ,且工作的时候,gpu 使用率是 100% 结果nvitop 统计的结果是,gpu 一直在工作,没有休眠,且使用率在 20% 左右 并没有真实的反应 gpu 的使用情况

hui-zhao-1 commented 1 year ago

下图是 nvidia-smi pmon 的结果 image

下图是 https://github.com/NVIDIA/dcgm-exporter 收集的结果,由于采集的间隔是 10s 所以并不准确,但是整体曲线的形状是正确的:

image

XuehaiPan commented 1 year ago

@2581543189 感谢提供如此详细的复现脚本!(我更新了一下你的 comment 的 Markdown 格式以提高可读性。)

你将 - 2_000_000 去除或减小值后能解决你说的问题吗?我也在本地测试一下结果。

这个程序明明是 休眠 10S 然后工作 2S ,且工作的时候,gpu 使用率是 100% 结果nvitop 统计的结果是,gpu 一直在工作,没有休眠,且使用率在 20% 左右 并没有真实的反应 gpu 的使用情况

此处的平均处理为 NVML 内部机制,我未发现有详细文档说明该问题。对于高采样率下的应用场景,额外的 2s 平滑确实可能引入问题。


另外我发现你的复现脚本中使用了:

from nvitop.gui import Device, colored

nvitop.gui 子模块中的 API 并不对外暴露,并且使用了 GPL-3.0 协议(nvitop.api 为 Apache-2.0 协议)。可改为:

from nvitop import Device, colored
hui-zhao-1 commented 1 year ago

可以用下面的代码验证:

# test.py
import schedule
import time
import pynvml
timestamp = 0
def test():
    global timestamp
    gpu_device_count = pynvml.nvmlDeviceGetCount()
    for gpu_index in range(gpu_device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
        try:
            processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp)
            for process in processes_util:
                print(gpu_index,str(process.pid),process.smUtil,process.timeStamp)
                timestamp = process.timeStamp - 2_000_000
                local_time = time.localtime(timestamp /1000 /1000)
                time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time)
            print("==============================",time_format,processes_util[0].smUtil)
        except pynvml.NVMLError_NotFound:
            continue

if __name__ == "__main__":
    pynvml.nvmlInit()
    schedule.every(1).seconds.do(test)
    while True:
        schedule.run_pending()
        time.sleep(1)

运行之前的 cuda 程序,如果保留 - 2_000_000 的话,输出如下:

============================== 2023-08-03 17:07:26 0
============================== 2023-08-03 17:07:26 0
============================== 2023-08-03 17:07:35 0
============================== 2023-08-03 17:07:34 40
============================== 2023-08-03 17:07:34 44
============================== 2023-08-03 17:07:34 39
============================== 2023-08-03 17:07:34 33
============================== 2023-08-03 17:07:34 29
============================== 2023-08-03 17:07:42 26
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:49 0
============================== 2023-08-03 17:07:48 64
============================== 2023-08-03 17:07:48 46
============================== 2023-08-03 17:07:48 38
============================== 2023-08-03 17:07:48 33
============================== 2023-08-03 17:07:48 29
============================== 2023-08-03 17:07:48 26
============================== 2023-08-03 17:07:48 23
============================== 2023-08-03 17:07:57 21
============================== 2023-08-03 17:07:56 0

如果删除 - 2_000_000 的话,输出如下:

============================== 2023-08-03 17:08:37 0
============================== 2023-08-03 17:08:38 0
============================== 2023-08-03 17:08:39 0
============================== 2023-08-03 17:08:40 85
============================== 2023-08-03 17:08:41 100
============================== 2023-08-03 17:08:42 45
============================== 2023-08-03 17:08:43 0
============================== 2023-08-03 17:08:44 0
============================== 2023-08-03 17:08:45 0
============================== 2023-08-03 17:08:46 0
============================== 2023-08-03 17:08:47 0
============================== 2023-08-03 17:08:48 0
============================== 2023-08-03 17:08:49 0
============================== 2023-08-03 17:08:50 0
============================== 2023-08-03 17:08:51 0
============================== 2023-08-03 17:08:52 57
============================== 2023-08-03 17:08:53 100
============================== 2023-08-03 17:08:54 70
============================== 2023-08-03 17:08:55 0
============================== 2023-08-03 17:08:56 0
============================== 2023-08-03 17:08:57 0
============================== 2023-08-03 17:08:58 0
hui-zhao-1 commented 1 year ago

image

XuehaiPan commented 1 year ago

@2581543189 我开了一个 PR 来删除这个额外的 -2s 的时间值。你可以试试:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

需要注意的是,若在 t 时刻调用了 device.processes(),则 device._timestamp = t。再过了 δt 之后调用 device.processes() 方法时,使用的 timestamp 为 t 而不是 t + δt 附近的值。在 δt 比较大的时候(如 >2s),依然会产生数据被平滑的情况。

另外,使用 nvidia-smi pmon 或者开启 daemon 进程会改变 NVML 的采样率,目前我还不太清楚这是否会对结果产生影响。

hui-zhao-1 commented 1 year ago

我使用

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

重新安装 nvitop 以后,按照之前回复中 cuda 程序的例子,收集监控数据,并没有什么变化,收集到的监控信息如下所示: image

然后我就想,如果每次查询process 的时候,_timestamp 不传递上次sample 的时间,而是直接传递当前时间,是否就可以忽略 δt 的影响,于是我 fork 了这个仓库,并进行了对应的修改:

https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp

使用下面命令安装以后,收集到的统计信息符合预期

pip3 install git+https://github.com/2581543189/nvitop.git@now-timestamp

监控信息如下图: image

XuehaiPan commented 1 year ago

@2581543189 感谢新的反馈。我更新了 PR #85 中的实现,即始终使用 epoch timestamp 来调用 NVML API:

samples = libnvml.nvmlQuery(
    'nvmlDeviceGetProcessUtilization',
    self.handle,
-   self._timestamp,
+   time.time_ns() // 1000,
    default=(),
    )

您可以试试:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

目前还不太清楚这种过于激进的 timestamp 策略(即始终使用调用时刻的 timestamp)是否会导致 sample buffer 中始终为空,或者大概率为空。nvtop 内部的实现是使用上次采样得到的所有 sample 中最大的 timestamp 来作为下次调用时的 timestamp:

https://github.com/Syllo/nvtop/blob/be47f8c560487efc6e6a419d59c69bfbdb819324/src/extract_gpuinfo_nvidia.c#L571-L608


更新: 我这边的本地测试表明直接使用 timestamp = time.time_ns() // 1000 会导致 buffer 大部分情况下为空而无 sample 返回。最新的 commit 额外增加了一个 1/4 秒的间隔:

samples = libnvml.nvmlQuery(
    'nvmlDeviceGetProcessUtilization',
    self.handle,
-   self._timestamp,
+   time.time_ns() // 1000 - 250_000,
    default=(),
    )
hui-zhao-1 commented 1 year ago

我这边测试也发现了相同的问题 https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp 这个改动中的 int(datetime.datetime.now().timestamp()) 这个操作,相当于随机对 now 减小了 0~999ms ,监控时间拉长以后,也发现了很多采样为空的现象:

image

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization 测试,采样为空的现象更加明显

image

我很好奇,为什么 nvidia-smi pmon 就没有这个问题,想要看它是怎么实现的,但是 google 发现 nvidia-smi 并不开源

目前感觉 使用 nvmlDeviceGetProcessUtilization 这个方法,无论 lastSeenTimeStamp 怎么传递,都无法准确的反应 nvitop-test 这个程序 gpu 的真实 sm 使用情况

XuehaiPan commented 1 year ago

用 pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization 测试,采样为空的现象更加明显

@2581543189 我在新的 commit 中额外增加了 1/4 秒的间隔,我本地测试效果还比较好。如下是使用你 https://github.com/XuehaiPan/nvitop/issues/83#issuecomment-1663404181 中提供的测试程序(修改了部分参数)。左边为 PR #85, 右边为 main (v1.2.0):

image

相比于 v1.2.0,修改后的程序的延迟明显更小。图形更符合方波波形,峰两遍的斜坡更窄。

hui-zhao-1 commented 1 year ago

增加了 1/4 秒的间隔 的代码我这边验证也是正常的 image 之前完全采不到样的截图是我自己操作失误导致的,工具是正常的

XuehaiPan commented 1 year ago

增加了 1/4 秒的间隔 的代码我这边验证也是正常的

好的!感谢反馈!