Datadog cluster agent timedout while getting external metrics

We have an hpa configured with datadog metrics. It was working fine for a while and then hpa started failing due to:

unable to fetch metrics from external metrics API: external metrics invalid

Captured following errors in cluster agent logs:

~$ stern datadog-cluster-agent --context=pod12-readonly | grep ERROR
+ datadog-cluster-agent-846db5687-zldfg › datadog-cluster-agent
+ datadog-cluster-agent-846db5687-k2rr8 › datadog-cluster-agent
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:04:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:04:40 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:05:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-zldfg datadog-cluster-agent 2019-08-07 22:05:10 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics
+ datadog-cluster-agent-846db5687-lt2pm › datadog-cluster-agent
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:04:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:04:33 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:05:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: context deadline exceeded
datadog-cluster-agent-846db5687-k2rr8 datadog-cluster-agent 2019-08-07 22:05:03 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:44 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/apiserver/hpa_controller.go:171 in updateExternalMetrics) | Error while retrieving external metrics from the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:44 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:03:45 UTC | CLUSTER | ERROR | (pkg/collector/runner/runner.go:294 in work) | Error running check kubernetes_apiserver: Failed to watch events: Get https://10.231.0.1:443/api/v1/events?resourceVersion=245985654&timeout=10s&watch=true: dial tcp 10.231.0.1:443: connect: connection refused
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:20 UTC | CLUSTER | ERROR | (pkg/collector/runner/runner.go:294 in work) | Error running check kubernetes_apiserver: Failed to watch events: Get https://10.231.0.1:443/api/v1/events?resourceVersion=245985654&timeout=10s&watch=true: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:24 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/apiserver/hpa_controller.go:171 in updateExternalMetrics) | Error while retrieving external metrics from the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:24 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:54 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:92 in externalMetricsSetter) | Could not list the external metrics in the store: Get https://10.231.0.1:443/api/v1/namespaces/default/configmaps/datadog-custom-metrics?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
datadog-cluster-agent-846db5687-lt2pm datadog-cluster-agent 2019-08-07 22:04:54 UTC | CLUSTER | ERROR | (pkg/clusteragent/custommetrics/provider.go:126 in externalMetricsSetter) | Timeout while processing the collection of external metrics

agent status was all green. Tried to collect flare but it seems to not work.

Asking the Cluster Agent to build the flare archive.
/tmp/datadog-agent-2019-08-08-00-53-44.zip is going to be uploaded to Datadog
Are you sure you want to upload a flare? [Y/N]
Y
An unknown error has occurred - Please contact support by email.
Error: unexpected end of JSON input
Usage:
  datadog-cluster-agent flare [caseID] [flags]

Flags:
  -e, --email string   Your email
  -h, --help           help for flare
  -s, --send           Automatically send flare (don't prompt for confirmation)

Global Flags:
  -c, --cfgpath string   path to directory containing datadog.yaml
  -n, --no-color         disable color output

Error: unexpected end of JSON input

I had to restart datadog-cluster-agent to recover from this issue.

DataDog / datadog-agent

Datadog cluster agent timedout while getting external metrics #3985