DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Datadog cluster agent timedout while getting external metrics #3985

Open ankilosaurus opened 5 years ago

ankilosaurus commented 5 years ago

We have an hpa configured with datadog metrics. It was working fine for a while and then hpa started failing due to:

unable to fetch metrics from external metrics API: external metrics invalid

Captured following errors in cluster agent logs:

~$ stern datadog-cluster-agent --context=pod12-readonly | grep ERROR
+ datadog-cluster-agent-846db5687-zldfg › datadog-cluster-agent
+ datadog-cluster-agent-846db5687-k2rr8 › datadog-cluster-agent
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:04:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:04:40 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:05:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:05:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics
+ datadog-cluster-agent-846db5687-lt2pm › datadog-cluster-agent
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:04:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:04:33 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:05:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:05:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:44 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/apiserver/hpa_controller.go:171 in updateExternalMetrics) | Error while retrieving external metrics from the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:44 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:45 UTC | CLUSTER | ERROR | (pkg/collector/runner/runner.go:294 in work) | Error running check kubernetes_apiserver: Failed to watch events: Get https://10.231.0.1:443/api/v1/events?resourceVersion=245985654&timeout=10s&watch=true: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:20 UTC | CLUSTER | ERROR | (pkg/collector/runner/runner.go:294 in work) | Error running check kubernetes_apiserver: Failed to watch events: Get https://10.231.0.1:443/api/v1/events?resourceVersion=245985654&timeout=10s&watch=true: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:24 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/apiserver/hpa_controller.go:171 in updateExternalMetrics) | Error while retrieving external metrics from the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:24 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:54 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:54 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics

agent status was all green. Tried to collect flare but it seems to not work.

Asking the Cluster Agent to build the flare archive.
/tmp/datadog-agent-2019-08-08-00-53-44.zip is going to be uploaded to Datadog
Are you sure you want to upload a flare? [Y/N]
Y
An unknown error has occurred - Please contact support by email.
Error: unexpected end of JSON input
Usage:
  datadog-cluster-agent flare [caseID] [flags]

Flags:
  -e, --email string   Your email
  -h, --help           help for flare
  -s, --send           Automatically send flare (don't prompt for confirmation)

Global Flags:
  -c, --cfgpath string   path to directory containing datadog.yaml
  -n, --no-color         disable color output

Error: unexpected end of JSON input

I had to restart datadog-cluster-agent to recover from this issue.

Simwar commented 5 years ago

Hi Ankit,

Thanks for reaching out! The issue is happening when trying to get the configMap the cluster agent creates to keep track of the different HPAs deployed. We try to retrieve this configMap from the APIServer and the APIServer is basically timing out because it is under pressure.

This PR will enhance the retry mechanism when these timeouts happen: https://github.com/DataDog/datadog-agent/pull/3727 This should unblock the HPA when the connection can be made to the APIServer even though it failed before. For now, we stop trying after several retries so if the APIServer accept connections again, the HPA won't work as we won't retry (if retries are expired). This will be shipped in the DCA v1.4.

Let me know if you need more details,

Regards,

Simon