Open tonglil opened 3 years ago
Hi @tonglil
There is this metric that you can see to see how many cluster agent are running and where: datadog.cluster_agent.running
There are also other datadog.cluster_agent
metrics that you can use, especially the go_memstats
ones. As you pointed out, they are not documented.
Thanks @Simwar. I'd like to know when cluster agents are down, like the regular agent.
Describe what happened: The cluster agent doesn't report any information to report it's up/down status, like the regular agents (
datadog.agent.up
).This makes tracking and understanding if cluster agents go down hard/impossible.
Describe what you expected:
A check or other metric to indicate if a cluster agent is up or down, so we can alert on it.
datadog.cluster_agent.up
A check or other metric to indicate what checks the cluster agent is performing
datadog.cluster_agent.check_status
The (cluster) agents should provide as much telemetry information to understand what it's actually doing, as Datadog itself is a "Cloud Monitoring Service".
Generally as it stands now, there's just not enough information about the agents or the platform usage for meta-monitoring.
There should also be documentation that describes what these checks and metrics are reporting. If you google for
"datadog.cluster_agent.prometheus.health"
and other checks/metrics the cluster agent reports, there are 0 results for docs or about what they are..Steps to reproduce the issue:
n/a
Additional environment details (Operating System, Cloud provider, etc):
n/a