DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.88k stars 1.21k forks source link

Cluster Agent does not report any status checks or up/down signals #8730

Open tonglil opened 3 years ago

tonglil commented 3 years ago

Describe what happened: The cluster agent doesn't report any information to report it's up/down status, like the regular agents (datadog.agent.up).

image

This makes tracking and understanding if cluster agents go down hard/impossible.

Describe what you expected:

A check or other metric to indicate if a cluster agent is up or down, so we can alert on it.

A check or other metric to indicate what checks the cluster agent is performing

The (cluster) agents should provide as much telemetry information to understand what it's actually doing, as Datadog itself is a "Cloud Monitoring Service".

Generally as it stands now, there's just not enough information about the agents or the platform usage for meta-monitoring.

There should also be documentation that describes what these checks and metrics are reporting. If you google for "datadog.cluster_agent.prometheus.health" and other checks/metrics the cluster agent reports, there are 0 results for docs or about what they are..

image image image image image image

Steps to reproduce the issue:

n/a

Additional environment details (Operating System, Cloud provider, etc):

n/a

Simwar commented 3 years ago

Hi @tonglil

There is this metric that you can see to see how many cluster agent are running and where: datadog.cluster_agent.running

There are also other datadog.cluster_agent metrics that you can use, especially the go_memstats ones. As you pointed out, they are not documented.

tonglil commented 3 years ago

Thanks @Simwar. I'd like to know when cluster agents are down, like the regular agent.