[cluster-agent] When Number of External Metrics gets to 55, all become invalid.

DataDog / datadog-agent

Main repository for Datadog Agent

https://docs.datadoghq.com/

Apache License 2.0

2.89k stars 1.21k forks source link

[cluster-agent] When Number of External Metrics gets to 55, all become invalid. #2996

Open george-miller opened 5 years ago

george-miller commented 5 years ago

We are using the cluster-agent on Kubernetes and we have a lot of HPAs which all read from datadog metrics to scale. These work great when we had less than 55 HPAs, they are all valid and work well to scale our apps.

However, when we reached 55 External Metrics over the course of about 5 seconds all of them became invalid. I was doing a watch agent status to try to see what was happening, and literally it went from

      Total: 50
      Valid: 45

      Total: 55
      Valid: 0

I don't want to cause our cluster any more downtime but that is the behavior is really weird and I couldn't find much related to it in logs (even with DD_LOG_LEVEL: trace). Is this a known maximum?

If needed I could make it happen again and get logs etc, but I'd rather not do it. Maybe you guys try a repro on your side with a cluster that has >55 HPAs?

CharlyF commented 5 years ago

Hey @george-miller, thanks for reaching out! I have never tried with this many HPAs to be honest, I'm surprised that you are hitting an issue at 55 but we would need to further investigate.

Could you open a ticket with support@datadoghq.com so we can gather as much info as possible to reproduce ?

Sorry for the headache!

Best, .C

george-miller commented 5 years ago

course, sounds good. I will open a ticket. Thanks!!

DylanLovesCoffee commented 5 years ago

Hey George! Just to close the loop here from what Charly discovered in the support ticket, the high number of HPAs is relative to the high number of queries involved in the request from the cluster-agent to our API.

With the test that we ran, the cluster-agent makes a batch query, and receives an HTTP/1.1 200 OK status but the following error, contained in the body, is raised: "Query aborted; too many subqueries attempted". The client used by the agent doesn't raise the error due to the 200 status.

We have a rough limit of 50 HPAs or distinct queries, which we'll work on documenting and explore the possibility of raising a 400 when the query length is too large. On top of that, we have backlogged work to paginate the batch queries if possible so that there is virtually no more limit in the number of HPA used. Customers would just need to ask to raise the rate limit if necessary.

Lastly, we would like to add logic in the upstream client so that even if we receive a 200 and the body status contains an error, it is surfaced.

cohenyair commented 5 years ago

Hi All -- we have added improved error handling and logging via #4285 so that HPAs are not invalidated should a query fail for some reason. This change will be included in the upcoming 1.4.0 release of the cluster agent.

If you need additional assistance please open a ticket at support@datadog.com. Our support team can help raise any query limits as well as offer advice on configurations.

Best, Yair