Open happy2048 opened 2 years ago
@happy2048 Can you try following and verify memory usage with each step to help narrow this down further.
kubectl edit clusterpolicy
and change dcgm.version
to 2.3.4-1-ubuntu20.04
. This aligns with recent version from here--set dcgm.enabled=false
(where dcgm-exporter will use embedded dcgm engine instead).Ok, I will test it and report the result.
@shivamerla I updated dcgm-exporter to 2.3.5-2.6.5-ubuntu20.04 and removed the env DCGM_REMOTE_HOSTENGINE_INFO to enable embeded mode and set the interval time of collectiing gpu metrics is 6000(generate metrics quickly).
the cpu and memory usage:
Fri Apr 22 08:22:41 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 8m 155Mi
nvidia-dcgm-exporter-hk25c 8m 155Mi
nvidia-dcgm-exporter-jrw4b 3m 135Mi
nvidia-dcgm-exporter-t72r6 9m 153Mi
nvidia-dcgm-exporter-wvbtm 5m 154Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Sun Apr 24 10:57:13 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 8m 194Mi
nvidia-dcgm-exporter-hk25c 8m 194Mi
nvidia-dcgm-exporter-jrw4b 3m 157Mi
nvidia-dcgm-exporter-t72r6 7m 201Mi
nvidia-dcgm-exporter-wvbtm 9m 198Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Mon Apr 25 02:43:14 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 8m 204Mi
nvidia-dcgm-exporter-hk25c 8m 204Mi
nvidia-dcgm-exporter-jrw4b 3m 163Mi
nvidia-dcgm-exporter-t72r6 8m 209Mi
nvidia-dcgm-exporter-wvbtm 10m 207Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Mon Apr 25 11:26:01 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 7m 208Mi
nvidia-dcgm-exporter-hk25c 10m 210Mi
nvidia-dcgm-exporter-jrw4b 4m 166Mi
nvidia-dcgm-exporter-t72r6 9m 212Mi
nvidia-dcgm-exporter-wvbtm 9m 211Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Tue Apr 26 02:43:44 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 7m 217Mi
nvidia-dcgm-exporter-hk25c 9m 220Mi
nvidia-dcgm-exporter-jrw4b 3m 172Mi
nvidia-dcgm-exporter-t72r6 9m 223Mi
nvidia-dcgm-exporter-wvbtm 8m 220Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Wed Apr 27 03:04:26 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 7m 231Mi
nvidia-dcgm-exporter-hk25c 9m 233Mi
nvidia-dcgm-exporter-jrw4b 5m 183Mi
nvidia-dcgm-exporter-t72r6 11m 236Mi
nvidia-dcgm-exporter-wvbtm 10m 234Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Fri Apr 29 02:21:40 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 8m 258Mi
nvidia-dcgm-exporter-hk25c 10m 261Mi
nvidia-dcgm-exporter-jrw4b 3m 202Mi
nvidia-dcgm-exporter-t72r6 10m 264Mi
nvidia-dcgm-exporter-wvbtm 8m 262Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
Thu May 5 02:08:31 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-2wvkz 1m 3Mi
nvidia-dcgm-dcmsh 1m 3Mi
nvidia-dcgm-exporter-cdxbn 7m 339Mi
nvidia-dcgm-exporter-hk25c 10m 341Mi
nvidia-dcgm-exporter-jrw4b 4m 261Mi
nvidia-dcgm-exporter-t72r6 10m 343Mi
nvidia-dcgm-exporter-wvbtm 8m 341Mi
nvidia-dcgm-g6b8d 1m 3Mi
nvidia-dcgm-jl52k 1m 5Mi
nvidia-dcgm-nbv5b 1m 3Mi
@shivamerla https://github.com/NVIDIA/dcgm-exporter/blob/main/pkg/dcgmexporter/dcgm.go#L83, the maxKeepAge is set to 0.0 is correct? 0.0 means no limit?
@shivamerla Is there a conclusion?
@happy2048 we are trying to reproduce this internally. I have tried with 510 and 470 latest drivers with above mentioned DCGM version, but couldn't reproduce it. Will try to test on CentOS system and verify.
ubuntu@ip-172-31-46-38:~$ helm ls -n gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator gpu-operator 1 2022-05-10 20:00:33.728500491 +0000 UTC deployed gpu-operator-v1.10.1 v1.10.1
$ sudo chroot /run/nvidia/driver nvidia-smi
Tue May 10 21:24:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 14W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ kubectl top pods -n gpu-operator
NAME CPU(cores) MEMORY(bytes)
gpu-feature-discovery-f27g2 0m 15Mi
gpu-operator-798c6ddc97-j6lx6 2m 14Mi
gpu-operator-node-feature-discovery-master-6c65c99969-ccmv8 3m 9Mi
gpu-operator-node-feature-discovery-worker-49x7z 4m 9Mi
nvidia-container-toolkit-daemonset-xl9p2 0m 8Mi
nvidia-dcgm-exporter-9nfc7 3m 141Mi
nvidia-device-plugin-daemonset-2pknr 1m 15Mi
nvidia-operator-validator-ccktf 0m 1Mi
$ curl http://10.110.55.189:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 300
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 405
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 29
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 14.685000
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 19425766
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_GPU_UTIL GPU utilization (in %).
# TYPE DCGM_FI_DEV_GPU_UTIL gauge
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 15109
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000000
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000000
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 0.000001
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 31570
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-e1200d64-e918-4277-cad9-75da90b9f618",device="nvidia0",modelName="Tesla T4",Hostname="nvidia-dcgm-exporter-ml2wh",container="dcgmproftester11",namespace="default",pod="dcgmproftester"} 42321
$ kubectl top pods -n gpu-operator
NAME CPU(cores) MEMORY(bytes)
gpu-feature-discovery-d88z9 1m 7Mi
gpu-operator-db9b746c6-59d98 2m 16Mi
gpu-operator-node-feature-discovery-master-6c65c99969-2gl4q 4m 10Mi
gpu-operator-node-feature-discovery-worker-h5ldb 1m 9Mi
nvidia-container-toolkit-daemonset-j88dd 0m 30Mi
nvidia-dcgm-exporter-ml2wh 5m 132Mi
nvidia-device-plugin-daemonset-4slfv 1m 15Mi
nvidia-driver-daemonset-qvz54 0m 1300Mi
nvidia-operator-validator-wv8jz 0m 0Mi
@shivamerla is there a sample gpu application is running when you test the case? It may not be reproduced if no program is using the GPU and it will take a few days to see the results(change the env DCGM_EXPORTER_INTERVAL to generate metrics quickly), my sample gpu application is the https://github.com/tensorflow/benchmarks/tree/cnn_tf_v2.1_compatible and the yaml is:
apiVersion: batch/v1
kind: Job
metadata:
name: tensorflow-benchmark
spec:
parallelism: 1
template:
metadata:
labels:
app: tensorflow-benchmark
spec:
containers:
- name: tensorflow-benchmark
image: registry.cn-hongkong.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
command:
- bash
- run.sh
- --num_batches=500000000
- --batch_size=8
resources:
limits:
nvidia.com/gpu: 1
workingDir: /root
restartPolicy: Never
I had run jupyter notebook with my tests, and now with the same workload you are running. I have changed the collection interval too. It went up a bit, but stable after that point. I did deploy it multiple times. Will keep monitoring this and check again. I will raise an internal bug to track this and update you if i see the same issue.
ubuntu@ip-172-31-42-254:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tensorflow-benchmark-z6kkr 1/1 Running 0 75s
ubuntu@ip-172-31-42-254:~$ sudo chroot /run/nvidia/driver nvidia-smi
Wed May 11 04:56:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 61C P0 69W / 70W | 8308MiB / 15109MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1976478 C python 8305MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-42-254:~$
ubuntu@ip-172-31-42-254:~$ kubectl top pods -n gpu-operator
NAME CPU(cores) MEMORY(bytes)
gpu-feature-discovery-d88z9 0m 7Mi
gpu-operator-db9b746c6-59d98 2m 18Mi
gpu-operator-node-feature-discovery-master-6c65c99969-2gl4q 4m 13Mi
gpu-operator-node-feature-discovery-worker-h5ldb 1m 10Mi
nvidia-container-toolkit-daemonset-j88dd 0m 30Mi
nvidia-dcgm-exporter-ps8rq 5m 145Mi
nvidia-device-plugin-daemonset-4slfv 1m 16Mi
nvidia-driver-daemonset-qvz54 0m 1259Mi
nvidia-operator-validator-wv8jz 0m 0Mi
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcp-metrics-included.csv
- name: DCGM_EXPORTER_INTERVAL
value: "6000"
image: nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubuntu20.04
imagePullPolicy: IfNotPresent
name: nvidia-dcgm-exporter
ports:
@happy2048 Can you try with UBI image on CentOS and verify it this happens? nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubi8
.
So far no luck with Ubuntu systems, so going to try with CentOS to match your system.
Ok, I will use the UBI image to test
@shivamerla I have tested the image nvcr.io/nvidia/k8s/dcgm-exporter:2.3.5-2.6.5-ubi8 and result is:
Sat May 14 04:19:02 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 3m 145Mi
nvidia-dcgm-exporter-745n2 7m 154Mi
Mon May 16 02:15:17 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 3m 171Mi
nvidia-dcgm-exporter-745n2 9m 188Mi
Wed May 18 11:04:48 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 3m 195Mi
nvidia-dcgm-exporter-745n2 7m 223Mi
Mon May 23 03:04:21 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 2m 242Mi
nvidia-dcgm-exporter-745n2 10m 286Mi
Wed May 25 12:33:29 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 3m 265Mi
nvidia-dcgm-exporter-745n2 9m 320Mi
Thu May 26 07:06:28 UTC 2022
# kubectl top po -n nvidia | grep nvidia-dcgm
nvidia-dcgm-exporter-2rt62 4m 273Mi
nvidia-dcgm-exporter-745n2 9m 328Mi
@glowkey @dualvtable Any additional information we can gather to reproduce this internally?
Which metrics are being watched? What is DCGM_EXPORTER_INTERVAL set to? In general what are all the changes from a default installation? This information could help us.
@glowkey I used the default csv file(/etc/dcgm-exporter/dcp-metrics-included.csv, not change it) and env DCGM_EXPORTER_INTERVAL is set to 6000
Environment
● Kubernetes: 1.20.11 ● OS: Centos7(3.10.0-1160.15.2.el7.x86_64) ● Docker: 19.03.15 ● NVIDIA Driver Version: 470.57.02 ● GPU: 4 * Tesla V100-SXM2-32GB ● GPU Operator Chart: v1.0.0-devel ● DCGM Docker Image: nvcr.io/nvidia/cloud-native/dcgm:2.2.3-ubuntu20.04 ● DCGM Exporter Docker Image: nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu20.04
Issue description
My k8s cluster has 5 gpu nodes:
and I deployed gpu-operator components in nvidia namespace:
then I deployed some pods with requesting gpus in default namespace, and they are running on node cn-hongkong.192.168.3.71, using the project of https://github.com/tensorflow/benchmarks/tree/cnn_tf_v2.1_compatible in pod to test gpu
and found the dcgm pod nvidia-dcgm-gsdct is running on node cn-hongkong.192.168.3.71
then I do nothing in the cluster, below is my recorded memory usage for dcgm pods
As you can see: ● the memory usage of pod nvidia-dcgm-gsdct increased from 109Mi to 423Mi,why? ● if no process is using gpus, no significant change in dcgm pod,eg: nvidia-dcgm-6pqml
@shivamerla