gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
251 stars 116 forks source link

MCM does not reset `.status.failedMachines` of MachineDeployment #456

Open rfranzke opened 4 years ago

rfranzke commented 4 years ago

What happened: MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2
  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal

(compare the timestamps)

What you expected to happen: The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

hardikdr commented 4 years ago

Dupe of https://github.com/gardener/machine-controller-manager/issues/476 . Closing in favor of the other.

hardikdr commented 4 years ago

/close

ggaurav10 commented 4 years ago

476 is about the failed machine metric counter, whereas this issue is about reseting the status field. These appear to be different issues, and hence are not duplicate.

Reopening this.

ggaurav10 commented 4 years ago

@rfranzke currently the .status.failedMachines field in the machine deployment is reset only when all the machines controlled by the machine deployment are healthy (running and have joined the cluster). Until then it contains the latest set of failed machine operations. Refer:

https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/deployment_machineset_util.go#L128-L135

Otherwise for some of the fail scenarios it can get difficult to catch the failed operations and reasons because of very quick create and delete operations. As a quick experiment, return a random failure here: https://github.com/gardener/machine-controller-manager/blob/e21b931ce25cc57b4267fce80e69839179854c20/pkg/controller/machine_bootstrap_token.go#L56

In the observation reported in the issue, once the machine joins and the machine deployment becomes healthy (all machines join the cluster), the failedMachines field would have been cleared.

Please let me know if there is any issue created because of current behaviour. We can take corrective measures accordingly. Thanks.

rfranzke commented 4 years ago

I still think this should be improved, i.e., the status should reflect the actual state of the world (which it currently does not - the machine is in the processing of joining the cluster and no longer failed). It's not too critical/important in the sense that we need it tomorrow, but it would improve user and ops experience/productivity to not show false negative status information.