gardener / dependency-watchdog

This controller checks the status of etcd and restarts control plane components which are in a state of crashloop-backoff over an extensive period of time.
Apache License 2.0
4 stars 28 forks source link

Exclude terminating and Failed machines/nodes from DWD probe calculation of Failed Leases #123

Open ashwani2k opened 6 months ago

ashwani2k commented 6 months ago

How to categorize this issue?

/area disaster-recovery /area robustness /kind enhancement /priority 1

What would you like to be added: Currently in DWD failed lease calculation all the nodes are considered to arrive at the nodeLeaseFailureFraction. This can be misleading for cases like:

  1. When we have nodes which are in Terminating or Failed phase esp. if they will take the entire machineDrainTimeout due to PDB or other issues with eviction of pods.
  2. Machines which are in crashLoopBackOff as they may never get created and including them in the count might also not be correct.

Why is this needed: To avoid DWD to mistakenly initiate meltdown protection for clusters which are quick to hit the nodeLeaseFailureFraction if they are having prolonged occurrence of the above 2 phases for nodes/machines. Also observed in issue-live(4796)

gardener-ci-robot commented 2 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

You can:

/lifecycle stale

unmarshall commented 2 months ago

/remove-lifecycle stale