DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
343 stars 1.01k forks source link

Datadog installation on GKE #747

Open containerkid opened 2 years ago

containerkid commented 2 years ago

Describe what happened: I am trying to install datadog on my GKE cluster and after updating the values.yaml i tried to run helm install command , and i can see some pods are running and some are in the status of 2/3 , when i did the pod describe i saw an error saying "Readiness probe failed: HTTP probe failed with statuscode: 500"

datadog-cluster-agent-5c8b5f5c5b-zdc95   1/1     Running   0          5m32s
datadog-jffxg                            2/3     Running   0          5m32s

when i do the pods describe status i am getting the below error 

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  36s   default-scheduler  Successfully assigned default/datadog-jffxg to gke-cluster-1-default-pool-c7770b23-66ms
  Normal   Pulled     35s   kubelet            Container image "gcr.io/datadoghq/agent:7.38.2" already present on machine
  Normal   Started    35s   kubelet            Started container init-volume
  Normal   Created    35s   kubelet            Created container init-volume
  Normal   Started    34s   kubelet            Started container init-config
  Normal   Pulled     34s   kubelet            Container image "gcr.io/datadoghq/agent:7.38.2" already present on machine
  Normal   Created    34s   kubelet            Created container init-config
  Normal   Created    33s   kubelet            Created container trace-agent
  Normal   Pulled     33s   kubelet            Container image "gcr.io/datadoghq/agent:7.38.2" already present on machine
  Normal   Started    33s   kubelet            Started container agent
  Normal   Pulled     33s   kubelet            Container image "gcr.io/datadoghq/agent:7.38.2" already present on machine
  Normal   Created    33s   kubelet            Created container agent
  Normal   Started    32s   kubelet            Started container trace-agent
  Normal   Pulled     32s   kubelet            Container image "gcr.io/datadoghq/agent:7.38.2" already present on machine
  Normal   Created    32s   kubelet            Created container process-agent
  Normal   Started    32s   kubelet            Started container process-agent
  Warning  Unhealthy  6s    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Describe what you expected:

i am expecting all the pods to be running and ready status

Steps to reproduce the issue:

helm repo add datadog https://helm.datadoghq.com
helm repo update
https://github.com/DataDog/helm-charts/blob/master/charts/datadog/values.yaml

helm install <RELEASE_NAME> -f values.yaml  --set datadog.apiKey=<DATADOG_API_KEY> datadog/datadog --set targetSystem=<TARGET_SYSTEM>

Additional environment details (Operating System, Cloud provider, etc): Google cloud GKE version

1.23.8-gke.1900 --

snorrid commented 2 years ago

This is also happening on my GKE autopilot cluster version 1.22.12-gke.2300.

ddaktn commented 1 year ago

I just ran into this EXACT same issue (liveness and readiness HTTP probes were returning were both returning a statuscode 500). When I checked my 'agent' pod logs, I could see 403's while trying to hit 'agent-http-intake.logs.datadoghq.com'. This lead me down the google rabbit hole and I found this article: https://docs.datadoghq.com/getting_started/site/ and that was my issue. Since my portal URL was pointing to us5.datadoghq.com, I had to change my values.yaml datadog.site value to match.

isbn390 commented 1 year ago

Set the clusterName and site in values.yaml and upgrade using helm

Ex: helm upgrade datadog/datadog --set datadog.apiKey= --values values.yaml Hope it will resolve.

gus-yoco commented 6 months ago

I just ran into this EXACT same issue (liveness and readiness HTTP probes were returning were both returning a statuscode 500). When I checked my 'agent' pod logs, I could see 403's while trying to hit 'agent-http-intake.logs.datadoghq.com'. This lead me down the google rabbit hole and I found this article: https://docs.datadoghq.com/getting_started/site/ and that was my issue. Since my portal URL was pointing to us5.datadoghq.com, I had to change my values.yaml datadog.site value to match.

Hi ddaktn, did you solve the issue? the liveness always fails on my side with connection refused to port 5555 and 8126

maddymanu commented 5 months ago

Did anybody figure this out? Still happening on GKE Autopilot 😟

ddaktn commented 5 months ago

I just ran into this EXACT same issue (liveness and readiness HTTP probes were returning were both returning a statuscode 500). When I checked my 'agent' pod logs, I could see 403's while trying to hit 'agent-http-intake.logs.datadoghq.com'. This lead me down the google rabbit hole and I found this article: https://docs.datadoghq.com/getting_started/site/ and that was my issue. Since my portal URL was pointing to us5.datadoghq.com, I had to change my values.yaml datadog.site value to match.

Hi ddaktn, did you solve the issue? the liveness always fails on my side with connection refused to port 5555 and 8126

Yes, changing it to point to the write URL did resolve my issue.