Open PadmaB opened 5 years ago
I have tried to at least expose a few crucial metrics into the Gardener Prometheus for now. Refer - https://github.com/gardener/gardener/pull/948.
However, we will need to further enhance metrics to always return values and not return blank values (like mcm_cloud_api_requests_failed_total
, mcm_cloud_api_requests_total
, mcm_machine_deployment_failed_machines
) for all the metrics before trying to create a dashboard and raise alerts. Refer - https://github.com/gardener/gardener/pull/948#issuecomment-485757761
/touch /priority critical
Changing the roadmap classification as this ticket speaks of "ops" and MCM metrics. This is more internal than end user facing, although one can argue that MCM appears in our exposed monitoring. If you don't agree, please change back @hardikdr . It was just a gut feeling that this is maybe more relevant internally than externally.
Sure, sounds good. The major part of it is for internal usage, and only an aspect is for end-users where we want to offer better observability for the worker-machines.
Adding feedback from https://github.com/gardener/machine-controller-manager/issues/549, https://github.com/gardener/machine-controller-manager/issues/528
We need to introduce metrics for following cases:
Failed
so the name needs to be changed from failed_machines
to something like failed_last_operation_machines
.We need an alternate metric for users.stale_machines_total
metric name to stale_machines_removed_total
, https://github.com/gardener/machine-controller-manager/pull/808#discussion_r1218369226requests_failed_total
, requests_total
in different mcm-provider, its currently exposed without update@elankath You have mentioned internal references in the public. Please check.
Story
As a provider I want timely alerts raised based on the metrics to take informed decisions
Motivation
Acceptance Criteria
Definition of Done
Possible metrices to add (Rough work)