Open ankilosaurus opened 5 years ago
Hi Ankit,
Thanks for reaching out! The issue is happening when trying to get the configMap the cluster agent creates to keep track of the different HPAs deployed. We try to retrieve this configMap from the APIServer and the APIServer is basically timing out because it is under pressure.
This PR will enhance the retry mechanism when these timeouts happen: https://github.com/DataDog/datadog-agent/pull/3727 This should unblock the HPA when the connection can be made to the APIServer even though it failed before. For now, we stop trying after several retries so if the APIServer accept connections again, the HPA won't work as we won't retry (if retries are expired). This will be shipped in the DCA v1.4.
Let me know if you need more details,
Regards,
Simon
We have an hpa configured with datadog metrics. It was working fine for a while and then hpa started failing due to:
Captured following errors in cluster agent logs:
agent status was all green. Tried to collect flare but it seems to not work.
I had to restart datadog-cluster-agent to recover from this issue.