gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
253 stars 116 forks source link

Improve Monitoring/Alerting/Metrics #211

Open PadmaB opened 5 years ago

PadmaB commented 5 years ago

Story

As a provider I want timely alerts raised based on the metrics to take informed decisions

Motivation

Acceptance Criteria

Definition of Done

Possible metrices to add (Rough work)

prashanth26 commented 5 years ago

I have tried to at least expose a few crucial metrics into the Gardener Prometheus for now. Refer - https://github.com/gardener/gardener/pull/948.

However, we will need to further enhance metrics to always return values and not return blank values (like mcm_cloud_api_requests_failed_total, mcm_cloud_api_requests_total, mcm_machine_deployment_failed_machines ) for all the metrics before trying to create a dashboard and raise alerts. Refer - https://github.com/gardener/gardener/pull/948#issuecomment-485757761

prashanth26 commented 3 years ago

/touch /priority critical

vlerenc commented 3 years ago

Changing the roadmap classification as this ticket speaks of "ops" and MCM metrics. This is more internal than end user facing, although one can argue that MCM appears in our exposed monitoring. If you don't agree, please change back @hardikdr . It was just a gut feeling that this is maybe more relevant internally than externally.

hardikdr commented 3 years ago

Sure, sounds good. The major part of it is for internal usage, and only an aspect is for end-users where we want to offer better observability for the worker-machines.

prashanth26 commented 3 years ago

Adding feedback from https://github.com/gardener/machine-controller-manager/issues/549, https://github.com/gardener/machine-controller-manager/issues/528

elankath commented 1 year ago

We need to introduce metrics for following cases:

gardener-robot commented 1 year ago

@elankath You have mentioned internal references in the public. Please check.