Open Dimss opened 4 years ago
I have exactly the same issue right now.
Same issue in EKS cluster now.
Same with on-premise cluster
Same with on-premise OKD 4.4 cluster
Well, I think I solved the problems
@tanrobotix can you share your solution?
In my case
The /etc/docker/daemon.json
is
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
The default-runtime is not set. It should be set with absolute path
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Then reload daemon and restart docker service
systemctl daemon-reload
systemctl restart docker
I am having the same issue with an on-premise cluster (1 VM-based master node, 2 DGX Station GPU nodes) setup via Ansible and DeepOps:
https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster
$:~/deepops# kubectl get nodes
NAME STATUS ROLES AGE VERSION
gpu01 NotReady,SchedulingDisabled <none> 2d v1.18.9
gpu02 Ready <none> 23h v1.18.9
mgmt01 Ready master 2d v1.18.9
$~/deepops# kubectl get pods
NAME READY STATUS RESTARTS AGE
dcgm-exporter-1608298867-jstgm 0/1 CrashLoopBackOff 242 19h
dcgm-exporter-1608298867-n52g2 1/1 Running 0 19h
gpu-operator-774ff7994c-gdpdl 1/1 Running 29 29h
gpu-test 0/1 Terminating 0 23h
ingress-nginx-controller-6b4fdfdcf7-sb5hs 1/1 Running 0 28h
nvidia-gpu-operator-node-feature-discovery-master-7d88b984j9grb 1/1 Running 3 29h
nvidia-gpu-operator-node-feature-discovery-worker-jpc24 1/1 Running 49 29h
nvidia-gpu-operator-node-feature-discovery-worker-lzn58 1/1 Running 26 29h
nvidia-gpu-operator-node-feature-discovery-worker-wfqgn 0/1 CrashLoopBackOff 195 19h
$~/deepops# kubectl logs pod/dcgm-exporter-1608298867-jstgm
time="2020-12-19T09:21:20Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-12-19T09:21:20Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
/etc/docker/daemon.json
however looks fine (identical) on both GPU nodes. Not sure if the persistent NotReady
status of one of the nodes is related to this dcgm-exporter
issue or not.
{
"default-runtime": "nvidia",
"default-shm-size": "1G",
"default-ulimits": {
"memlock": {
"hard": -1,
"name": "memlock",
"soft": -1
},
"stack": {
"hard": 67108864,
"name": "stack",
"soft": 67108864
}
},
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Many thanks in advance for any advice :)
Hi. Based in this stack overflow question, we solved using:
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/version: "2.1.1"
app.kubernetes.io/component: "dcgm-exporter"
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/version: "2.1.1"
app.kubernetes.io/component: "dcgm-exporter"
spec:
type: ClusterIP
ports:
- name: "metrics"
port: 9400
targetPort: 9400
protocol: TCP
selector:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/version: "2.1.1"
app.kubernetes.io/component: "dcgm-exporter"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/component: "dcgm-exporter"
template:
metadata:
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/component: "dcgm-exporter"
spec:
serviceAccountName: dcgm-exporter
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
image: nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04
imagePullPolicy: IfNotPresent
name: dcgm-exporter
ports:
- containerPort: 9400
name: metrics
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/pod-resources
name: pod-gpu-resources
readOnly: true
- mountPath: /usr/local/nvidia
name: nvidia-install-dir-host
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- hostPath:
path: /var/lib/kubelet/pod-resources
type: ""
name: pod-gpu-resources
- hostPath:
path: /home/kubernetes/bin/nvidia
type: ""
name: nvidia-install-dir-host
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/version: "2.1.1"
app.kubernetes.io/component: "dcgm-exporter"
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
app.kubernetes.io/instance: dcgm-exporter
app.kubernetes.io/component: "dcgm-exporter"
endpoints:
- port: "metrics"
path: "/metrics"
interval: "15s"
To work with "all the clouds providers", use the affinity with labels (expecting that you have at least one user label defined in the gpu nodes):
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: In
values:
- label-key1
- label-key2
Trying the suggested solution here of:
securityContext.privileged=true
nvidia-install-dir-host
hostPath volume + volumeMount
We've seen that this resolved the issue for GKE using nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04
, but with the more recent nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
it's still broken.
We've opted to downgrade for now, of courseTried downgrading dcgm-exporter
from 2.2.9-2.4.1-ubuntu20.04
to 2.0.13-2.1.1-ubuntu18.04
, and setting securityContext.privileged=true
- keeps failing.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 48s default-scheduler Successfully assigned monitoring-system/dcgm-exporter-8z8n5 to gke-np-epo-sentalign-euwe4a-gke--gpus-0fd942e9-zom5
Normal Pulling 48s kubelet Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
Normal Pulled 37s kubelet Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" in 10.688555255s
Normal Created 16s (x3 over 34s) kubelet Created container exporter
Normal Started 16s (x3 over 34s) kubelet Started container exporter
Normal Pulled 16s (x2 over 33s) kubelet Container image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" already present on machine
Warning BackOff 12s (x5 over 32s) kubelet Back-off restarting failed container
❯ kubectl -n monitoring-system logs -p dcgm-exporter-8z8n5
time="2021-09-14T07:03:10Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2021-09-14T07:03:10Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"
Also tried adding all volumes
and volumeMounts
from the nvidia-gpu-device-plugin
DaemonSet, which is added by GKE - didn't fix the problem
Trying the suggested solution here of:
* Adding `securityContext.privileged=true` * Adding `nvidia-install-dir-host` hostPath volume + volumeMount We've seen that this resolved the issue for GKE using `nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04`, but with the more recent `nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04` it's still broken. We've opted to downgrade for now, of course
It works for me.
I've GKE cluster with GPU node pull. The GPU nodes has valid labels, the nvidia device plugin pods are running on each GPU node and the nvida driver daemon set was deployed as well. The K8S detects 1 allocatable GPU. However, when I'm deploying the
kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml
thedcgm-exporter
pod is inCrashLoopBackOff
with the following error:The exact same setup works on AKS and EKS clusters without any issue.
Is there any limitation to use the
dcgm-exporter
on GKE?