NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
864 stars 153 forks source link

DCGM_FI_DEV_MEM_COPY_UTIL not correct always 1 or 2 #345

Closed xuchenCN closed 3 months ago

xuchenCN commented 3 months ago

What is the version?

3.3.6-3.4.2

What happened?

nvidia-smi

Mon Jun 24 15:48:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-32GB           Off | 00000000:2F:00.0 Off |                    0 |
| N/A   53C    P0              47W / 250W |  31566MiB / 32768MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE-32GB           Off | 00000000:86:00.0 Off |                    0 |
| N/A   49C    P0              31W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3348219      C   python                                    31562MiB |
+---------------------------------------------------------------------------------------+

metrics :

DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-597e0abb-006d-64d5-5244-c7bc889e02d6",device="nvidia0",modelName="Tesla V100-PCIE-32GB",Hostname="120-gpu-c28",DCGM_FI_DRIVER_VERSION="535.161.08",container="xxxx",namespace="xxxx",pod="xxxxx"} 1

The number always 1 or 2

What did you expect to happen?

Show correct metrics in %

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

2024/06/24 07:30:37 maxprocs: Leaving GOMAXPROCS=48: CPU quota undefined
time="2024-06-24T07:30:37Z" level=info msg="Starting dcgm-exporter"
time="2024-06-24T07:30:37Z" level=info msg="DCGM successfully initialized!"
time="2024-06-24T07:30:37Z" level=info msg="Collecting DCP Metrics"
time="2024-06-24T07:30:37Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/default-counters.csv'"
time="2024-06-24T07:30:37Z" level=info msg="Initializing system entities of type: GPU"
time="2024-06-24T07:30:37Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-06-24T07:30:37Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-06-24T07:30:37Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-06-24T07:30:37Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-06-24T07:30:37Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-06-24T07:30:37Z" level=info msg="Pipeline starting"
time="2024-06-24T07:30:37Z" level=info msg="Starting webserver"
time="2024-06-24T07:30:37Z" level=info msg="Listening on" address="[::]:9400"
time="2024-06-24T07:30:37Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false
nvvfedorov commented 3 months ago

Hi @xuchenCN , Could you please explain why you think the values 1 or 2 might not be accurate?

xuchenCN commented 3 months ago

Hi @xuchenCN , Could you please explain why you think the values 1 or 2 might not be accurate?

As documents say DCGM_FI_DEV_MEM_COPY_UTIL means Memory Utilization.

As the metrics descriptor say

# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge

nvidia-smi shown memory utilization in % should be 31566MiB/32768MiB, I don't know why dcgm-exporter Memory utilization (in %). is 1 or 2

xuchenCN commented 3 months ago

OK, use DCGM_FI_DEV_FB_USED_PERCENT metrics to resolve my issue, thx.