NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

dcgm-exporter falied so start on GKE cluster (v1.16.11-gke.5) #96

Open Dimss opened 4 years ago

Dimss commented 4 years ago

I've GKE cluster with GPU node pull. The GPU nodes has valid labels, the nvidia device plugin pods are running on each GPU node and the nvida driver daemon set was deployed as well. The K8S detects 1 allocatable GPU. However, when I'm deploying the kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/2.0.0-rc.12/dcgm-exporter.yaml the dcgm-exporter pod is in CrashLoopBackOff with the following error:

time="2020-07-21T18:49:33Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-07-21T18:49:33Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

The exact same setup works on AKS and EKS clusters without any issue.

Is there any limitation to use the dcgm-exporter on GKE?

Morishiri commented 4 years ago

I have exactly the same issue right now.

tamizhgeek commented 4 years ago

Same issue in EKS cluster now.

tanrobotix commented 4 years ago

Same with on-premise cluster

vizgin commented 3 years ago

Same with on-premise OKD 4.4 cluster

tanrobotix commented 3 years ago

Well, I think I solved the problems

Dimss commented 3 years ago

@tanrobotix can you share your solution?

tanrobotix commented 3 years ago

In my case The /etc/docker/daemon.json is

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

The default-runtime is not set. It should be set with absolute path

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Then reload daemon and restart docker service

systemctl daemon-reload
systemctl restart docker
aaroncnb commented 3 years ago

I am having the same issue with an on-premise cluster (1 VM-based master node, 2 DGX Station GPU nodes) setup via Ansible and DeepOps:

https://github.com/NVIDIA/deepops/tree/master/docs/k8s-cluster

$:~/deepops# kubectl get nodes
NAME     STATUS                        ROLES    AGE   VERSION
gpu01    NotReady,SchedulingDisabled   <none>   2d    v1.18.9
gpu02    Ready                         <none>   23h   v1.18.9
mgmt01   Ready                         master   2d    v1.18.9
$~/deepops# kubectl get pods
NAME                                                              READY   STATUS             RESTARTS   AGE
dcgm-exporter-1608298867-jstgm                                    0/1     CrashLoopBackOff   242        19h
dcgm-exporter-1608298867-n52g2                                    1/1     Running            0          19h
gpu-operator-774ff7994c-gdpdl                                     1/1     Running            29         29h
gpu-test                                                          0/1     Terminating        0          23h
ingress-nginx-controller-6b4fdfdcf7-sb5hs                         1/1     Running            0          28h
nvidia-gpu-operator-node-feature-discovery-master-7d88b984j9grb   1/1     Running            3          29h
nvidia-gpu-operator-node-feature-discovery-worker-jpc24           1/1     Running            49         29h
nvidia-gpu-operator-node-feature-discovery-worker-lzn58           1/1     Running            26         29h
nvidia-gpu-operator-node-feature-discovery-worker-wfqgn           0/1     CrashLoopBackOff   195        19h
$~/deepops# kubectl logs pod/dcgm-exporter-1608298867-jstgm
time="2020-12-19T09:21:20Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2020-12-19T09:21:20Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

/etc/docker/daemon.json however looks fine (identical) on both GPU nodes. Not sure if the persistent NotReady status of one of the nodes is related to this dcgm-exporter issue or not.

{
    "default-runtime": "nvidia",
    "default-shm-size": "1G",
    "default-ulimits": {
        "memlock": {
            "hard": -1,
            "name": "memlock",
            "soft": -1
        },
        "stack": {
            "hard": 67108864,
            "name": "stack",
            "soft": 67108864
        }
    },
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Many thanks in advance for any advice :)

andre-lx commented 3 years ago

Hi. Based in this stack overflow question, we solved using:

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  type: ClusterIP
  ports:
  - name: "metrics"
    port: 9400
    targetPort: 9400
    protocol: TCP
  selector:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dcgm-exporter
        app.kubernetes.io/instance: dcgm-exporter
        app.kubernetes.io/component: "dcgm-exporter"
    spec:
      serviceAccountName: dcgm-exporter
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        image: nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04
        imagePullPolicy: IfNotPresent
        name: dcgm-exporter
        ports:
        - containerPort: 9400
          name: metrics
          protocol: TCP
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
        - mountPath: /usr/local/nvidia
          name: nvidia-install-dir-host
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - hostPath:
          path: /home/kubernetes/bin/nvidia
          type: ""
        name: nvidia-install-dir-host
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: dcgm-exporter
    app.kubernetes.io/version: "2.1.1"
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter
      app.kubernetes.io/component: "dcgm-exporter"
  endpoints:
  - port: "metrics"
    path: "/metrics"
    interval: "15s"

To work with "all the clouds providers", use the affinity with labels (expecting that you have at least one user label defined in the gpu nodes):

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: type
                operator: In
                values:
                - label-key1
                - label-key2
omesser commented 3 years ago

Trying the suggested solution here of:

sadovnikov commented 3 years ago

Tried downgrading dcgm-exporter from 2.2.9-2.4.1-ubuntu20.04 to 2.0.13-2.1.1-ubuntu18.04, and setting securityContext.privileged=true - keeps failing.

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  48s                default-scheduler  Successfully assigned monitoring-system/dcgm-exporter-8z8n5 to gke-np-epo-sentalign-euwe4a-gke--gpus-0fd942e9-zom5
  Normal   Pulling    48s                kubelet            Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"
  Normal   Pulled     37s                kubelet            Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" in 10.688555255s
  Normal   Created    16s (x3 over 34s)  kubelet            Created container exporter
  Normal   Started    16s (x3 over 34s)  kubelet            Started container exporter
  Normal   Pulled     16s (x2 over 33s)  kubelet            Container image "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04" already present on machine
  Warning  BackOff    12s (x5 over 32s)  kubelet            Back-off restarting failed container

❯ kubectl -n monitoring-system logs -p dcgm-exporter-8z8n5
time="2021-09-14T07:03:10Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2021-09-14T07:03:10Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Also tried adding all volumes and volumeMounts from the nvidia-gpu-device-plugin DaemonSet, which is added by GKE - didn't fix the problem

ciiiii commented 2 years ago

Trying the suggested solution here of:

* Adding `securityContext.privileged=true`

* Adding `nvidia-install-dir-host` hostPath volume + volumeMount
  We've seen that this resolved the issue for GKE using `nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04`, but with the more recent `nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04` it's still broken.
  We've opted to downgrade for now, of course

It works for me.