gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

Improve MCM log messages/status for meltdown scenario #742

Closed himanshu-kun closed 10 months ago

himanshu-kun commented 2 years ago

How to categorize this issue?

/area ops-productivity /kind enhancement /priority 2

What would you like to be added: Currently the meltdown scenario doesn't present good enough logs for any external user to figure out that meltdown control is taking place in MCM logs like these are present currently

I0802 09:58:48.435461       1 machine_util.go:1427] machineDeployName="shoot--prod-gcp--in30-prod-operator-z1" for machine="shoot--prod-gcp--in30-prod-operator-z1-76d9d-f67h6" , terminating=0 , failed=0 , pending=1 , noPhase=0 , crashLooping=0 , extraCountedProgress=0
.
.
.
I0802 10:01:03.444084       1 machine_util.go:1427] machineDeployName="shoot--prod-gcp--in30-prod-operator-z1" for machine="shoot--prod-gcp--in30-prod-operator-z1-76d9d-f67h6" , terminating=0 , failed=0 , pending=0 , noPhase=0 , crashLooping=0 , extraCountedProgress=0

This needs to be improved. A new status field could be thought of too(needs to be considered as part of #724 ). Also the Playbook needs to be updated to tell scenario when meltdown can happen.

Why is this needed: Clearer more understandable logs

himanshu-kun commented 1 year ago

Post grooming Discussion

The problem arose as the dev-ops couldn't figure out why Unknown machine not turning Failed and what are these weird logs.

We need to solve it by enhancing the logs enough to help the dev-ops know that MCM is throttling conversion of Unknown to Failed machine , and is following meltdown logic. And we should also update the Playbook to update DoDs abt meltdown scenario. It can be a link to our docs.

himanshu-kun commented 10 months ago

/close