NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter pod is crashingoff #161

Open anaconda2196 opened 3 years ago

anaconda2196 commented 3 years ago

I using helm version: 3.5.2 Kubernetes Cluster: 1.19.5

kubect logs "pod"

time="2021-03-05T07:35:54Z" level=info msg="Starting dcgm-exporter" time="2021-03-05T07:35:54Z" level=info msg="DCGM successfully initialized!" time="2021-03-05T07:35:54Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=info msg="Kubernetes metrics collection enabled!" time="2021-03-05T07:35:54Z" level=info msg="Starting webserver" time="2021-03-05T07:35:54Z" level=info msg="Pipeline starting"

kubectl describe pod "xyz":

Type Reason Age From Message


Normal Scheduled 83s default-scheduler Successfully assigned default/dcgm-exporter-xyz-xyz to "my SERVER" Warning Unhealthy 48s (x3 over 68s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Normal Killing 41s (x2 over 61s) kubelet Container exporter failed liveness probe, will be restarted Normal Pulling 40s (x3 over 83s) kubelet Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.0-ubuntu18.04" Normal Pulled 38s (x3 over 81s) kubelet Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.0-ubuntu18.04" Normal Created 38s (x3 over 81s) kubelet Created container exporter Normal Started 38s (x3 over 80s) kubelet Started container exporter Warning Unhealthy 38s kubelet Readiness probe failed: Get http://x.x.x.x:9400/health: dial tcp x.x.x.x:9400: connect: connection refused Warning Unhealthy 31s (x7 over 71s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503

nick-oconnor commented 3 years ago

@anaconda2196 I'm having the exact same issue. Last version that works with the helm chart is 2.1.2. I get

Readiness probe failed: HTTP probe failed with statuscode: 503
nick-oconnor commented 3 years ago

@anaconda2196 Changing initialDelaySeconds to 30 for the daemon set liveness and readiness probes fixes the issue. I'm not sure if the increased startup time is a bug or a byproduct of the updates. If it's expected, initialDelaySeconds should be updated in the daemon set template for the helm chart.

anaconda2196 commented 3 years ago

Hi @nick-oconnor

It seems like latest updates in https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/deployment/dcgm-exporter have some issues.

Pod is crashingoff: image

https://nvidia.github.io/gpu-monitoring-tools/helm-charts/dcgm-exporter-2.2.0.tgz (It worked for me. Just, I have changed initialDelaySeconds and periodSeconds to 59)

Also, I am following https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#setting-up-dcgm documents. It has to be update.

Thank you.

nick-oconnor commented 3 years ago

@anaconda2196 I noticed that 2nd error too. Looks like v2.3.1 was never pushed to nvidia's repo.

bhaktatejas922 commented 3 years ago

@anaconda2196 I noticed that 2nd error too. Looks like v2.3.1 was never pushed to nvidia's repo.

Yep noticed this as well. @RenaudWasTaken can you/someone push out v2.3.1 to the repo? this is breaking kube builds from following the nvidia instructions.

Legion2 commented 3 years ago

Same here, the image does not exist nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 but is used in the helm chart.

Legion2 commented 3 years ago

on docker hub the image is available nvidia/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04

elezar commented 3 years ago

Hi, @Legion2 @anaconda2196 @nick-oconnor. We have pushed the images to nvcr.io. Please check if the problem has been resolved.

Legion2 commented 3 years ago

@elezar perfect it works now

elezar commented 3 years ago

Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.

nvtkaszpir commented 3 years ago

Tested today and my pod was crashlooping, after some debugging looks like dcgm-exporter gets metrics every 30s, and this is much higher than the probes (on som pretty ancient hardware, though).

I've passed to helm values file:

# need to set it low so that readiness/liveness probes succeede
extraEnv:
  - name: "DCGM_EXPORTER_INTERVAL"
    value: "5000"

and now it works.

nick-oconnor commented 3 years ago

@nvtkaszpir good find! I'll experiment. FYI DCGM_EXPORTER_INTERVAL is in milliseconds. 10 seems very low.

nvtkaszpir commented 3 years ago

oh right this is in miliseconds, yeah, 10 ms is an overkill, edited inital comment to make it 10s.

nick-oconnor commented 3 years ago

@nvtkaszpir it seems happy on my setup at 5000 and not happy at 10000. That makes sense given the period of the check is 5s.

nvtkaszpir commented 3 years ago

ok, well, for me it works with 10s, but I guess probes are every 5s and you an get sometimes unlucky, so maybe DCGM_EXPORTER_INTERVAL 5000 is ok?

anaconda2196 commented 3 years ago

Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.

Just small issue that I faced previously also with 2.2.0 release. I have changed initialDelaySeconds and periodSeconds to 59 otherwise it gives me liveness and readiness prob failed error.

m-brgs commented 3 years ago

Can we change this on the real source of this repo ? Can someone test if 30s is enough and report back or if we need to move up to 59s and maybe make it a helm value ?

https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools/-/merge_requests/58

anaconda2196 commented 3 years ago

I have quick question/doubt: I have created K8s cluster and added 1 GPU machine, dcgm-exporter pod is running properly and collecting gpu metrics, but in values.yaml

nodeSelector: {}
  #node: gpu

then how it is recognizing to deploy pod only on GPU node?

nvtkaszpir commented 3 years ago

@anaconda2196 you need gpu-feature-discovery it will add labels to the nodes https://github.com/NVIDIA/gpu-feature-discovery

then redeploy nvidia-device-plugin and dcgm-exporter with node selector, this will deploy it only on the nodes with nvidia card detected (nvidia's pci dev vendor id is 10de) :

nodeSelector:
  feature.node.kubernetes.io/pci-10de.present: 'true'
anaconda2196 commented 3 years ago

@anaconda2196 you need gpu-feature-discovery it will add labels to the nodes https://github.com/NVIDIA/gpu-feature-discovery

then redeploy nvidia-device-plugin and dcgm-exporter with node selector, this will deploy it only on the nodes with nvidia card detected (nvidia's pci dev vendor id is 10de) :

nodeSelector:
  feature.node.kubernetes.io/pci-10de.present: 'true'

Thanks for reply. but my question/doubt is that in my K8s cluster nvidia-device-plugin pod is already running before implementing dcgm-exporter.

My K8s cluster has 2 nodes: 1 master (non-gpu machine) | 1 worker (4 Tesla gpu machine)

Now, after implementing kube-prometheus-stack and then dcgm-exporter, dcgm-exporter pod is running properly and collecting gpu metrics. (Only one dcgm-exporter pod is deployed).

But when I checked values.yaml

nodeSelector: {}
  #node: gpu

So, without node-selector here currently how it is recognize/identify/understand that it has to deploy on only gpu node?

nvtkaszpir commented 3 years ago

If you have only two nodes, which one is master only, then probably master node has taint preventing specific pods to be placed on it. As a result the only other node left will run the pods.

anaconda2196 commented 3 years ago

If you have only two nodes, which one is master only, then probably master node has taint preventing specific pods to be placed on it. As a result the only other node left will run the pods.

[Note: nvidia-device-plugin pod deploys whenever I make k8s cluster, it is my choice to deploy dcgm-exporter or not]

Now So, what changes do I have to make in values.yaml if I make a K8s Cluster Case 1: 1 Master (GPU Machine) | 1 Worker ( GPU Machine) Case 2: 1 Master (Non-GPU Machine) | 2 worker (GPU Machine)

nvtkaszpir commented 3 years ago
  1. in this case you need to remove taint from master node or add extra tain tolerations to the given deployments/daemonsets so that pod can be scheduled on the master https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
  2. this can be done as I said before, by using nvidia gpu-feature-discovery and using nodeSelector to limit gpu specific pods to nodes witht the gpus. This usually is mixed with the master node taint if required.
anaconda2196 commented 3 years ago

Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.

Hi @elezar

Since last few hours I am getting issue of ImagePullBackoff. I am using dcgm-exporter-2.2.0

`Normal   Scheduled  80s                default-scheduler  Successfully assigned kube-system/dcgm-exporter-8rjvs to "xxx-myserver--xxx"
  Warning  Failed     34s (x2 over 64s)  kubelet            Failed to pull image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04": rpc error: code = Unknown desc = Error response from daemon: Get https://nvcr.io/v2/nvidia/k8s/dcgm-exporter/manifests/2.1.4-2.2.0-ubuntu18.04: Get https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Fk8s%2Fdcgm-exporter%3Apull: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Failed     34s (x2 over 64s)  kubelet            Error: ErrImagePull
  Normal   BackOff    20s (x2 over 64s)  kubelet            Back-off pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04"
  Warning  Failed     20s (x2 over 64s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    9s (x3 over 79s)   kubelet            Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04"`
nvtkaszpir commented 3 years ago

@anaconda2196 this is an issue with http://nvcr.io/ registry, and this is not really related to the initial git issue.

anaconda2196 commented 3 years ago

@anaconda2196 this is an issue with http://nvcr.io/ registry, and this is not really related to the initial git issue.

Yeah ik but what would be alternative possible solution regarding this?

nvtkaszpir commented 3 years ago

using your own private repo as mirror, but I guess this is a bit to late for that :D

anaconda2196 commented 3 years ago

using your own private repo as mirror, but I guess this is a bit to late for that :D

I prefer to wait till http://nvcr.io/ registry up again! :D

Queetinliu commented 3 years ago

@anaconda2196 Changing initialDelaySeconds to 30 for the daemon set liveness and readiness probes fixes the issue. I'm not sure if the increased startup time is a bug or a byproduct of the updates. If it's expected, initialDelaySeconds should be updated in the daemon set template for the helm chart.

It's so confusing,and I changed initialDelaySeconds in daemonset.yaml under templates directory then pod running.