Open anaconda2196 opened 3 years ago
@anaconda2196 I'm having the exact same issue. Last version that works with the helm chart is 2.1.2
. I get
Readiness probe failed: HTTP probe failed with statuscode: 503
@anaconda2196 Changing initialDelaySeconds
to 30
for the daemon set liveness and readiness probes fixes the issue. I'm not sure if the increased startup time is a bug or a byproduct of the updates. If it's expected, initialDelaySeconds
should be updated in the daemon set template for the helm chart.
Hi @nick-oconnor
It seems like latest updates in https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/deployment/dcgm-exporter have some issues.
Pod is crashingoff:
https://nvidia.github.io/gpu-monitoring-tools/helm-charts/dcgm-exporter-2.2.0.tgz (It worked for me. Just, I have changed initialDelaySeconds and periodSeconds to 59)
Also, I am following https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#setting-up-dcgm documents. It has to be update.
Thank you.
@anaconda2196 I noticed that 2nd error too. Looks like v2.3.1 was never pushed to nvidia's repo.
@anaconda2196 I noticed that 2nd error too. Looks like v2.3.1 was never pushed to nvidia's repo.
Yep noticed this as well. @RenaudWasTaken can you/someone push out v2.3.1 to the repo? this is breaking kube builds from following the nvidia instructions.
Same here, the image does not exist nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
but is used in the helm chart.
on docker hub the image is available nvidia/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
Hi, @Legion2 @anaconda2196 @nick-oconnor. We have pushed the images to nvcr.io. Please check if the problem has been resolved.
@elezar perfect it works now
Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.
Tested today and my pod was crashlooping, after some debugging looks like dcgm-exporter gets metrics every 30s, and this is much higher than the probes (on som pretty ancient hardware, though).
I've passed to helm values file:
# need to set it low so that readiness/liveness probes succeede
extraEnv:
- name: "DCGM_EXPORTER_INTERVAL"
value: "5000"
and now it works.
@nvtkaszpir good find! I'll experiment. FYI DCGM_EXPORTER_INTERVAL
is in milliseconds. 10 seems very low.
oh right this is in miliseconds, yeah, 10 ms is an overkill, edited inital comment to make it 10s.
@nvtkaszpir it seems happy on my setup at 5000
and not happy at 10000
. That makes sense given the period of the check is 5s.
ok, well, for me it works with 10s, but I guess probes are every 5s and you an get sometimes unlucky, so maybe DCGM_EXPORTER_INTERVAL 5000 is ok?
Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.
Just small issue that I faced previously also with 2.2.0 release. I have changed initialDelaySeconds and periodSeconds to 59 otherwise it gives me liveness and readiness prob failed error.
Can we change this on the real source of this repo ? Can someone test if 30s is enough and report back or if we need to move up to 59s and maybe make it a helm value ?
https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools/-/merge_requests/58
I have quick question/doubt: I have created K8s cluster and added 1 GPU machine, dcgm-exporter pod is running properly and collecting gpu metrics, but in values.yaml
nodeSelector: {}
#node: gpu
then how it is recognizing to deploy pod only on GPU node?
@anaconda2196 you need gpu-feature-discovery it will add labels to the nodes https://github.com/NVIDIA/gpu-feature-discovery
then redeploy nvidia-device-plugin and dcgm-exporter with node selector, this will deploy it only on the nodes with nvidia card detected (nvidia's pci dev vendor id
is 10de
) :
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: 'true'
@anaconda2196 you need gpu-feature-discovery it will add labels to the nodes https://github.com/NVIDIA/gpu-feature-discovery
then redeploy nvidia-device-plugin and dcgm-exporter with node selector, this will deploy it only on the nodes with nvidia card detected (nvidia's pci dev vendor id is
10de
) :nodeSelector: feature.node.kubernetes.io/pci-10de.present: 'true'
Thanks for reply. but my question/doubt is that in my K8s cluster nvidia-device-plugin pod is already running before implementing dcgm-exporter.
My K8s cluster has 2 nodes: 1 master (non-gpu machine) | 1 worker (4 Tesla gpu machine)
Now, after implementing kube-prometheus-stack and then dcgm-exporter, dcgm-exporter pod is running properly and collecting gpu metrics. (Only one dcgm-exporter pod is deployed).
But when I checked values.yaml
nodeSelector: {}
#node: gpu
So, without node-selector here currently how it is recognize/identify/understand that it has to deploy on only gpu node?
If you have only two nodes, which one is master only, then probably master node has taint preventing specific pods to be placed on it. As a result the only other node left will run the pods.
If you have only two nodes, which one is master only, then probably master node has taint preventing specific pods to be placed on it. As a result the only other node left will run the pods.
[Note: nvidia-device-plugin pod deploys whenever I make k8s cluster, it is my choice to deploy dcgm-exporter or not]
Now So, what changes do I have to make in values.yaml if I make a K8s Cluster Case 1: 1 Master (GPU Machine) | 1 Worker ( GPU Machine) Case 2: 1 Master (Non-GPU Machine) | 2 worker (GPU Machine)
Great! @anaconda2196 please check if your original issue has been resolved and close the issue accordingly.
Hi @elezar
Since last few hours I am getting issue of ImagePullBackoff. I am using dcgm-exporter-2.2.0
`Normal Scheduled 80s default-scheduler Successfully assigned kube-system/dcgm-exporter-8rjvs to "xxx-myserver--xxx"
Warning Failed 34s (x2 over 64s) kubelet Failed to pull image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04": rpc error: code = Unknown desc = Error response from daemon: Get https://nvcr.io/v2/nvidia/k8s/dcgm-exporter/manifests/2.1.4-2.2.0-ubuntu18.04: Get https://nvcr.io/proxy_auth?scope=repository%3Anvidia%2Fk8s%2Fdcgm-exporter%3Apull: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Failed 34s (x2 over 64s) kubelet Error: ErrImagePull
Normal BackOff 20s (x2 over 64s) kubelet Back-off pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04"
Warning Failed 20s (x2 over 64s) kubelet Error: ImagePullBackOff
Normal Pulling 9s (x3 over 79s) kubelet Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu18.04"`
@anaconda2196 this is an issue with http://nvcr.io/ registry, and this is not really related to the initial git issue.
@anaconda2196 this is an issue with http://nvcr.io/ registry, and this is not really related to the initial git issue.
Yeah ik but what would be alternative possible solution regarding this?
using your own private repo as mirror, but I guess this is a bit to late for that :D
using your own private repo as mirror, but I guess this is a bit to late for that :D
I prefer to wait till http://nvcr.io/ registry up again! :D
@anaconda2196 Changing
initialDelaySeconds
to30
for the daemon set liveness and readiness probes fixes the issue. I'm not sure if the increased startup time is a bug or a byproduct of the updates. If it's expected,initialDelaySeconds
should be updated in the daemon set template for the helm chart.
It's so confusing,and I changed initialDelaySeconds in daemonset.yaml under templates directory then pod running.
I using helm version: 3.5.2 Kubernetes Cluster: 1.19.5
kubect logs "pod"
time="2021-03-05T07:35:54Z" level=info msg="Starting dcgm-exporter" time="2021-03-05T07:35:54Z" level=info msg="DCGM successfully initialized!" time="2021-03-05T07:35:54Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled" time="2021-03-05T07:35:54Z" level=info msg="Kubernetes metrics collection enabled!" time="2021-03-05T07:35:54Z" level=info msg="Starting webserver" time="2021-03-05T07:35:54Z" level=info msg="Pipeline starting"
kubectl describe pod "xyz":
Type Reason Age From Message
Normal Scheduled 83s default-scheduler Successfully assigned default/dcgm-exporter-xyz-xyz to "my SERVER" Warning Unhealthy 48s (x3 over 68s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Normal Killing 41s (x2 over 61s) kubelet Container exporter failed liveness probe, will be restarted Normal Pulling 40s (x3 over 83s) kubelet Pulling image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.0-ubuntu18.04" Normal Pulled 38s (x3 over 81s) kubelet Successfully pulled image "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.0-ubuntu18.04" Normal Created 38s (x3 over 81s) kubelet Created container exporter Normal Started 38s (x3 over 80s) kubelet Started container exporter Warning Unhealthy 38s kubelet Readiness probe failed: Get http://x.x.x.x:9400/health: dial tcp x.x.x.x:9400: connect: connection refused Warning Unhealthy 31s (x7 over 71s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503