NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 282 forks source link

The DCGM has a memory leak? #340

Open happy2048 opened 2 years ago

happy2048 commented 2 years ago

Environment

● Kubernetes: 1.20.11 ● OS: Centos7(3.10.0-1160.15.2.el7.x86_64) ● Docker: 19.03.15 ● NVIDIA Driver Version: 470.57.02 ● GPU: 4 * Tesla V100-SXM2-32GB ● GPU Operator Chart: v1.0.0-devel ● DCGM Docker Image: nvcr.io/nvidia/cloud-native/dcgm:2.2.3-ubuntu20.04 ● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu20.04

Issue description

My k8s cluster has 5 gpu nodes:

# kubectl get nodes
NAME                        STATUS   ROLES    AGE   VERSION
cn-hongkong.192.168.1.115   Ready    <none>   64d   v1.20.11-aliyun.1
cn-hongkong.192.168.3.71    Ready    <none>   42h   v1.20.11-aliyun.1
cn-hongkong.192.168.3.72    Ready    <none>   64d   v1.20.11-aliyun.1
cn-hongkong.192.168.3.73    Ready    <none>   64d   v1.20.11-aliyun.1
cn-hongkong.192.168.3.74    Ready    <none>   64d   v1.20.11-aliyun.1

and I deployed gpu-operator components in nvidia namespace:

# kubectl get po -n nvidia
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-8n7zp                                   1/1     Running     0          42h
gpu-feature-discovery-94pz2                                   1/1     Running     0          42h
gpu-feature-discovery-c6pxp                                   1/1     Running     0          42h
gpu-feature-discovery-fvhrw                                   1/1     Running     0          42h
gpu-feature-discovery-jfqgp                                   1/1     Running     0          42h
gpu-operator-75d64c8b9f-prrdb                                 1/1     Running     0          42h
gpu-operator-node-feature-discovery-master-58d884d5cc-x9f54   1/1     Running     1          42h
gpu-operator-node-feature-discovery-worker-6qmhf              1/1     Running     0          42h
gpu-operator-node-feature-discovery-worker-8skp6              1/1     Running     0          42h
gpu-operator-node-feature-discovery-worker-dgwjz              1/1     Running     0          42h
gpu-operator-node-feature-discovery-worker-qknfs              1/1     Running     0          42h
gpu-operator-node-feature-discovery-worker-slmd8              1/1     Running     0          42h
nvidia-container-toolkit-daemonset-5kj7z                      1/1     Running     0          42h
nvidia-container-toolkit-daemonset-t84ns                      1/1     Running     0          42h
nvidia-container-toolkit-daemonset-tk6hg                      1/1     Running     0          42h
nvidia-container-toolkit-daemonset-vb84m                      1/1     Running     0          42h
nvidia-container-toolkit-daemonset-xskw9                      1/1     Running     0          42h
nvidia-cuda-validator-9br2f                                   0/1     Completed   0          42h
nvidia-cuda-validator-9dl2q                                   0/1     Completed   0          42h
nvidia-cuda-validator-wcxnp                                   0/1     Completed   0          42h
nvidia-cuda-validator-zfwxc                                   0/1     Completed   0          42h
nvidia-cuda-validator-znzrh                                   0/1     Completed   0          42h
nvidia-dcgm-6pqml                                             1/1     Running     0          42h
nvidia-dcgm-8fqlm                                             1/1     Running     0          42h
nvidia-dcgm-exporter-4499t                                    1/1     Running     0          42h
nvidia-dcgm-exporter-f829p                                    1/1     Running     0          42h
nvidia-dcgm-exporter-nlvtj                                    1/1     Running     0          42h
nvidia-dcgm-exporter-svmbs                                    1/1     Running     0          42h
nvidia-dcgm-exporter-zgnzd                                    1/1     Running     4          42h
nvidia-dcgm-fkzlc                                             1/1     Running     0          42h
nvidia-dcgm-gqpm5                                             1/1     Running     0          42h
nvidia-dcgm-gsdct                                             1/1     Running     0          42h
nvidia-device-plugin-daemonset-7fd6l                          1/1     Running     0          42h
nvidia-device-plugin-daemonset-fm265                          1/1     Running     0          42h
nvidia-device-plugin-daemonset-jc6gb                          1/1     Running     0          42h
nvidia-device-plugin-daemonset-n7jpw                          1/1     Running     0          42h
nvidia-device-plugin-daemonset-szqmc                          1/1     Running     0          42h
nvidia-device-plugin-validator-92g2g                          0/1     Completed   0          42h
nvidia-device-plugin-validator-ch6bb                          0/1     Completed   0          42h
nvidia-device-plugin-validator-ngfc5                          0/1     Completed   0          42h
nvidia-device-plugin-validator-qfvhx                          0/1     Completed   0          42h
nvidia-device-plugin-validator-s9wmk                          0/1     Completed   0          42h
nvidia-driver-daemonset-jgg2p                                 1/1     Running     0          42h
nvidia-driver-daemonset-p5jrm                                 1/1     Running     0          42h
nvidia-driver-daemonset-plzfg                                 1/1     Running     0          42h
nvidia-driver-daemonset-qzlgh                                 1/1     Running     0          42h
nvidia-driver-daemonset-spk8w                                 1/1     Running     0          42h
nvidia-operator-validator-6rhx8                               1/1     Running     0          42h
nvidia-operator-validator-8lqkk                               1/1     Running     0          42h
nvidia-operator-validator-gtghz                               1/1     Running     0          42h
nvidia-operator-validator-jcdhw                               1/1     Running     0          42h
nvidia-operator-validator-kc7b6                               1/1     Running     0          42h

then I deployed some pods with requesting gpus in default namespace, and they are running on node cn-hongkong.192.168.3.71, using the project of https://github.com/tensorflow/benchmarks/tree/cnn_tf_v2.1_compatible in pod to test gpu

# kubectl get po -o wide 
NAME                                  READY   STATUS    RESTARTS   AGE   IP          NODE                       NOMINATED NODE   READINESS GATES
tensorflow-benchmark-gpushare-2qldj   1/1     Running   0          42h   10.2.1.85   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-4w89g   1/1     Running   0          42h   10.2.1.75   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-8lvc9   1/1     Running   0          42h   10.2.1.78   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-9fd2f   1/1     Running   0          42h   10.2.1.81   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-9lsfl   1/1     Running   0          42h   10.2.1.79   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-nxjnl   1/1     Running   0          42h   10.2.1.74   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-pg4j7   1/1     Running   0          42h   10.2.1.76   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-s2sp6   1/1     Running   0          42h   10.2.1.77   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-shnrd   1/1     Running   0          42h   10.2.1.80   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-sxzkq   1/1     Running   0          42h   10.2.1.82   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-vvzgl   1/1     Running   0          42h   10.2.1.83   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-gpushare-wprqt   1/1     Running   0          42h   10.2.1.84   cn-hongkong.192.168.3.71   <none>           <none>
tensorflow-benchmark-w7dh4            1/1     Running   0          42h   10.2.1.86   cn-hongkong.192.168.3.71   <none>           <none>

and found the dcgm pod nvidia-dcgm-gsdct is running on node cn-hongkong.192.168.3.71

# kubectl get po -n nvidia nvidia-dcgm-gsdct -o wide
NAME                READY   STATUS    RESTARTS   AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
nvidia-dcgm-gsdct   1/1     Running   0          45h   192.168.3.71   cn-hongkong.192.168.3.71   <none>           <none>

then I do nothing in the cluster, below is my recorded memory usage for dcgm pods

Tue Apr 19 08:49:56 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-6pqml                                             30m          108Mi
nvidia-dcgm-8fqlm                                             15m          104Mi
nvidia-dcgm-exporter-4499t                                    1m           23Mi
nvidia-dcgm-exporter-f829p                                    1m           26Mi
nvidia-dcgm-exporter-nlvtj                                    1m           39Mi
nvidia-dcgm-exporter-svmbs                                    0m           27Mi
nvidia-dcgm-exporter-zgnzd                                    1m           35Mi
nvidia-dcgm-fkzlc                                             28m          107Mi
nvidia-dcgm-gqpm5                                             29m          108Mi
nvidia-dcgm-gsdct                                             32m          109Mi

Wed Apr 20 02:43:59 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-6pqml                                             31m          111Mi
nvidia-dcgm-8fqlm                                             14m          120Mi
nvidia-dcgm-exporter-4499t                                    2m           41Mi
nvidia-dcgm-exporter-f829p                                    2m           44Mi
nvidia-dcgm-exporter-nlvtj                                    1m           43Mi
nvidia-dcgm-exporter-svmbs                                    0m           28Mi
nvidia-dcgm-exporter-zgnzd                                    2m           42Mi
nvidia-dcgm-fkzlc                                             29m          110Mi
nvidia-dcgm-gqpm5                                             29m          111Mi
nvidia-dcgm-gsdct                                             46m          230Mi

Wed Apr 20 09:58:26 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-6pqml                                             28m          112Mi
nvidia-dcgm-8fqlm                                             15m          126Mi
nvidia-dcgm-exporter-4499t                                    2m           42Mi
nvidia-dcgm-exporter-f829p                                    1m           43Mi
nvidia-dcgm-exporter-nlvtj                                    2m           43Mi
nvidia-dcgm-exporter-svmbs                                    1m           28Mi
nvidia-dcgm-exporter-zgnzd                                    1m           43Mi
nvidia-dcgm-fkzlc                                             28m          111Mi
nvidia-dcgm-gqpm5                                             30m          112Mi
nvidia-dcgm-gsdct                                             45m          282Mi

Thu Apr 21 02:39:44 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-6pqml                                             29m          114Mi
nvidia-dcgm-8fqlm                                             14m          141Mi
nvidia-dcgm-exporter-4499t                                    1m           44Mi
nvidia-dcgm-exporter-f829p                                    1m           46Mi
nvidia-dcgm-exporter-nlvtj                                    1m           44Mi
nvidia-dcgm-exporter-svmbs                                    0m           29Mi
nvidia-dcgm-exporter-zgnzd                                    0m           45Mi
nvidia-dcgm-fkzlc                                             27m          113Mi
nvidia-dcgm-gqpm5                                             26m          114Mi
nvidia-dcgm-gsdct                                             47m          398Mi

Thu Apr 21 06:17:53 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-6pqml                                             29m          115Mi
nvidia-dcgm-8fqlm                                             14m          144Mi
nvidia-dcgm-exporter-4499t                                    2m           43Mi
nvidia-dcgm-exporter-f829p                                    2m           46Mi
nvidia-dcgm-exporter-nlvtj                                    1m           44Mi
nvidia-dcgm-exporter-svmbs                                    1m           28Mi
nvidia-dcgm-exporter-zgnzd                                    2m           45Mi
nvidia-dcgm-fkzlc                                             30m          113Mi
nvidia-dcgm-gqpm5                                             24m          115Mi
nvidia-dcgm-gsdct                                             43m          423Mi

As you can see: ● the memory usage of pod nvidia-dcgm-gsdct increased from 109Mi to 423Mi,why? ● if no process is using gpus, no significant change in dcgm pod,eg: nvidia-dcgm-6pqml

@shivamerla

shivamerla commented 2 years ago

@happy2048 Can you try following and verify memory usage with each step to help narrow this down further.

  1. Edit clusterpolicy with kubectl edit clusterpolicy and change dcgm.version to 2.3.4-1-ubuntu20.04. This aligns with recent version from here
  2. Re-install with --set dcgm.enabled=false (where dcgm-exporter will use embedded dcgm engine instead).
happy2048 commented 2 years ago

Ok, I will test it and report the result.

happy2048 commented 2 years ago

@shivamerla I updated dcgm-exporter to 2.3.5-2.6.5-ubuntu20.04 and removed the env DCGM_REMOTE_HOSTENGINE_INFO to enable embeded mode and set the interval time of collectiing gpu metrics is 6000(generate metrics quickly).

image

the cpu and memory usage:

Fri Apr 22 08:22:41 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    8m           155Mi
nvidia-dcgm-exporter-hk25c                                    8m           155Mi
nvidia-dcgm-exporter-jrw4b                                    3m           135Mi
nvidia-dcgm-exporter-t72r6                                    9m           153Mi
nvidia-dcgm-exporter-wvbtm                                    5m           154Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Sun Apr 24 10:57:13 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    8m           194Mi
nvidia-dcgm-exporter-hk25c                                    8m           194Mi
nvidia-dcgm-exporter-jrw4b                                    3m           157Mi
nvidia-dcgm-exporter-t72r6                                    7m           201Mi
nvidia-dcgm-exporter-wvbtm                                    9m           198Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Mon Apr 25 02:43:14 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    8m           204Mi
nvidia-dcgm-exporter-hk25c                                    8m           204Mi
nvidia-dcgm-exporter-jrw4b                                    3m           163Mi
nvidia-dcgm-exporter-t72r6                                    8m           209Mi
nvidia-dcgm-exporter-wvbtm                                    10m          207Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Mon Apr 25 11:26:01 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    7m           208Mi
nvidia-dcgm-exporter-hk25c                                    10m          210Mi
nvidia-dcgm-exporter-jrw4b                                    4m           166Mi
nvidia-dcgm-exporter-t72r6                                    9m           212Mi
nvidia-dcgm-exporter-wvbtm                                    9m           211Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Tue Apr 26 02:43:44 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    7m           217Mi
nvidia-dcgm-exporter-hk25c                                    9m           220Mi
nvidia-dcgm-exporter-jrw4b                                    3m           172Mi
nvidia-dcgm-exporter-t72r6                                    9m           223Mi
nvidia-dcgm-exporter-wvbtm                                    8m           220Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Wed Apr 27 03:04:26 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    7m           231Mi
nvidia-dcgm-exporter-hk25c                                    9m           233Mi
nvidia-dcgm-exporter-jrw4b                                    5m           183Mi
nvidia-dcgm-exporter-t72r6                                    11m          236Mi
nvidia-dcgm-exporter-wvbtm                                    10m          234Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Fri Apr 29 02:21:40 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    8m           258Mi
nvidia-dcgm-exporter-hk25c                                    10m          261Mi
nvidia-dcgm-exporter-jrw4b                                    3m           202Mi
nvidia-dcgm-exporter-t72r6                                    10m          264Mi
nvidia-dcgm-exporter-wvbtm                                    8m           262Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi

Thu May  5 02:08:31 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz                                             1m           3Mi
nvidia-dcgm-dcmsh                                             1m           3Mi
nvidia-dcgm-exporter-cdxbn                                    7m           339Mi
nvidia-dcgm-exporter-hk25c                                    10m          341Mi
nvidia-dcgm-exporter-jrw4b                                    4m           261Mi
nvidia-dcgm-exporter-t72r6                                    10m          343Mi
nvidia-dcgm-exporter-wvbtm                                    8m           341Mi
nvidia-dcgm-g6b8d                                             1m           3Mi
nvidia-dcgm-jl52k                                             1m           5Mi
nvidia-dcgm-nbv5b                                             1m           3Mi
happy2048 commented 2 years ago

@shivamerla https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/dcgm.go#L83, the maxKeepAge is set to 0.0 is correct? 0.0 means no limit?

happy2048 commented 2 years ago

@shivamerla Is there a conclusion?

shivamerla commented 2 years ago

@happy2048 we are trying to reproduce this internally. I have tried with 510 and 470 latest drivers with above mentioned DCGM version, but couldn't reproduce it. Will try to test on CentOS system and verify.

ubuntu@ip-172-31-46-38:~$ helm ls -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2022-05-10 20:00:33.728500491 +0000 UTC deployed        gpu-operator-v1.10.1    v1.10.1  

$ sudo chroot /run/nvidia/driver nvidia-smi
Tue May 10 21:24:47 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8    14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ kubectl  top pods -n gpu-operator
NAME                                                          CPU(cores)   MEMORY(bytes)   
gpu-feature-discovery-f27g2                                   0m           15Mi            
gpu-operator-798c6ddc97-j6lx6                                 2m           14Mi            
gpu-operator-node-feature-discovery-master-6c65c99969-ccmv8   3m           9Mi             
gpu-operator-node-feature-discovery-worker-49x7z              4m           9Mi             
nvidia-container-toolkit-daemonset-xl9p2                      0m           8Mi             
nvidia-dcgm-exporter-9nfc7                                    3m           141Mi           
nvidia-device-plugin-daemonset-2pknr                          1m           15Mi            
nvidia-operator-validator-ccktf                               0m           1Mi 

$ curl http://10.110.55.189:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 300
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 405
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 29
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 14.685000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 19425766
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 15109
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000000
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000001
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 31570
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 42321

$ kubectl  top pods -n gpu-operator
NAME                                                          CPU(cores)   MEMORY(bytes)   
gpu-feature-discovery-d88z9                                   1m           7Mi             
gpu-operator-db9b746c6-59d98                                  2m           16Mi            
gpu-operator-node-feature-discovery-master-6c65c99969-2gl4q   4m           10Mi            
gpu-operator-node-feature-discovery-worker-h5ldb              1m           9Mi             
nvidia-container-toolkit-daemonset-j88dd                      0m           30Mi            
nvidia-dcgm-exporter-ml2wh                                    5m           132Mi           
nvidia-device-plugin-daemonset-4slfv                          1m           15Mi            
nvidia-driver-daemonset-qvz54                                 0m           1300Mi          
nvidia-operator-validator-wv8jz                               0m           0Mi 
happy2048 commented 2 years ago

@shivamerla is there a sample gpu application is running when you test the case? It may not be reproduced if no program is using the GPU and it will take a few days to see the results(change the env DCGM_EXPORTER_INTERVAL to generate metrics quickly), my sample gpu application is the https://github.com/tensorflow/benchmarks/tree/cnn_tf_v2.1_compatible and the yaml is:

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-hongkong.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=500000000
        - --batch_size=8
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never
shivamerla commented 2 years ago

I had run jupyter notebook with my tests, and now with the same workload you are running. I have changed the collection interval too. It went up a bit, but stable after that point. I did deploy it multiple times. Will keep monitoring this and check again. I will raise an internal bug to track this and update you if i see the same issue.


ubuntu@ip-172-31-42-254:~$ kubectl  get pods
NAME                         READY   STATUS    RESTARTS   AGE
tensorflow-benchmark-z6kkr   1/1     Running   0          75s

ubuntu@ip-172-31-42-254:~$ sudo chroot /run/nvidia/driver nvidia-smi
Wed May 11 04:56:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   61C    P0    69W /  70W |   8308MiB / 15109MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1976478      C   python                           8305MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-42-254:~$ 
ubuntu@ip-172-31-42-254:~$ kubectl  top pods -n gpu-operator
NAME                                                          CPU(cores)   MEMORY(bytes)   
gpu-feature-discovery-d88z9                                   0m           7Mi             
gpu-operator-db9b746c6-59d98                                  2m           18Mi            
gpu-operator-node-feature-discovery-master-6c65c99969-2gl4q   4m           13Mi            
gpu-operator-node-feature-discovery-worker-h5ldb              1m           10Mi            
nvidia-container-toolkit-daemonset-j88dd                      0m           30Mi            
nvidia-dcgm-exporter-ps8rq                                    5m           145Mi           
nvidia-device-plugin-daemonset-4slfv                          1m           16Mi            
nvidia-driver-daemonset-qvz54                                 0m           1259Mi          
nvidia-operator-validator-wv8jz                               0m           0Mi   
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcp-metrics-included.csv
        - name: DCGM_EXPORTER_INTERVAL
          value: "6000"
        image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
        imagePullPolicy: IfNotPresent
        name: nvidia-dcgm-exporter
        ports:
shivamerla commented 2 years ago

@happy2048 Can you try with UBI image on CentOS and verify it this happens? nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubi8.

So far no luck with Ubuntu systems, so going to try with CentOS to match your system.

happy2048 commented 2 years ago

Ok, I will use the UBI image to test

happy2048 commented 2 years ago

@shivamerla I have tested the image nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubi8 and result is:

Sat May 14 04:19:02 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    3m           145Mi
nvidia-dcgm-exporter-745n2                                    7m           154Mi

Mon May 16 02:15:17 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    3m           171Mi
nvidia-dcgm-exporter-745n2                                    9m           188Mi

Wed May 18 11:04:48 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    3m           195Mi
nvidia-dcgm-exporter-745n2                                    7m           223Mi

Mon May 23 03:04:21 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    2m           242Mi
nvidia-dcgm-exporter-745n2                                    10m          286Mi

Wed May 25 12:33:29 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    3m           265Mi
nvidia-dcgm-exporter-745n2                                    9m           320Mi

Thu May 26 07:06:28 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62                                    4m           273Mi
nvidia-dcgm-exporter-745n2                                    9m           328Mi

image

shivamerla commented 2 years ago

@glowkey @dualvtable Any additional information we can gather to reproduce this internally?

glowkey commented 2 years ago

Which metrics are being watched? What is DCGM_EXPORTER_INTERVAL set to? In general what are all the changes from a default installation? This information could help us.

happy2048 commented 2 years ago

@glowkey I used the default csv file(/etc/dcgm-exporter/dcp-metrics-included.csv, not change it) and env DCGM_EXPORTER_INTERVAL is set to 6000 image