Cluster Agent does not report any status checks or up/down signals

tonglil commented 3 years ago

Describe what happened: The cluster agent doesn't report any information to report it's up/down status, like the regular agents (datadog.agent.up).

This makes tracking and understanding if cluster agents go down hard/impossible.

Describe what you expected:

A check or other metric to indicate if a cluster agent is up or down, so we can alert on it.

datadog.cluster_agent.up

A check or other metric to indicate what checks the cluster agent is performing

datadog.cluster_agent.check_status

The (cluster) agents should provide as much telemetry information to understand what it's actually doing, as Datadog itself is a "Cloud Monitoring Service".

Generally as it stands now, there's just not enough information about the agents or the platform usage for meta-monitoring.

There should also be documentation that describes what these checks and metrics are reporting. If you google for "datadog.cluster_agent.prometheus.health" and other checks/metrics the cluster agent reports, there are 0 results for docs or about what they are..

Steps to reproduce the issue:

n/a

Additional environment details (Operating System, Cloud provider, etc):

n/a

Simwar commented 3 years ago

Hi @tonglil

There is this metric that you can see to see how many cluster agent are running and where: datadog.cluster_agent.running

There are also other datadog.cluster_agent metrics that you can use, especially the go_memstats ones. As you pointed out, they are not documented.

tonglil commented 3 years ago

Thanks @Simwar. I'd like to know when cluster agents are down, like the regular agent.

DataDog / datadog-agent

Cluster Agent does not report any status checks or up/down signals #8730