cloudfoundry-incubator / kubecf

Cloud Foundry on Kubernetes
Apache License 2.0
115 stars 62 forks source link

When shutting down a node and bringing it back up, KubeCF will not restart apps when using Eirini. #1548

Open gaktive opened 3 years ago

gaktive commented 3 years ago

Describe the bug Placeholder until we get @satadruroy to confirm what @troytop spotted -- when a Kubernetes node goes down with KubeCF using Eirini, once the node comes back up, all apps associated with that node need to be manually started. This should be an automatic process as we see in Diego.

To Reproduce

Expected behavior When a node goes down and comes back up, apps come back automatically.

satadruroy commented 3 years ago

I was able to repro this - sort of. My setup - AKS with 3 nodes, Eirini HA deployment.

  1. Deploy app with 3 instances.

cf apps

#0   running   2020-11-17T00:23:11Z   0.3%   264.6M of 1G   160K of 1G
#1   running   2020-11-17T00:23:11Z   0.2%   274.9M of 1G   160K of 1G
#2   running   2020-11-17T00:23:12Z   0.2%   267.2M of 1G   160K of 1G

Check to make sure the pods with the apps are scheduled on 3 different nodes.

  1. Stop one of the nodes. (AKS does not detect node shutdown and automatically boot up another one)

cf apps output snippet:

     state     since                  cpu    memory         disk         details
#0   running   2020-11-17T00:23:11Z   0.0%   0 of 1G        0 of 1G
#1   running   2020-11-17T00:23:11Z   0.2%   275M of 1G     160K of 1G
#2   running   2020-11-17T00:23:12Z   0.2%   267.4M of 1G   160K of 1G

The output of #0 shows running whereas in reality it is stopped as the node is stopped.

  1. Restart the node - wait for it to complete...
     state     since                  cpu    memory         disk         details
#0   crashed   2020-11-17T00:23:12Z   0.0%   0 of 1G        0 of 1G
#1   running   2020-11-17T00:23:12Z   0.2%   275.2M of 1G   160K of 1G
#2   running   2020-11-17T00:23:13Z   0.2%   267.6M of 1G   160K of 1G

... even though the pods are back up and running.

Screen Shot 2020-11-16 at 5 46 07 PM

The events-reporter in eirini-events namespace has the following error:

Screen Shot 2020-11-16 at 5 51 30 PM

But eventually it recovers on its own.

     state     since                  cpu    memory         disk         details
#0   running   2020-11-17T01:36:41Z   0.3%   319.9M of 1G   164K of 1G
#1   running   2020-11-17T00:23:12Z   0.2%   275.4M of 1G   160K of 1G
#2   running   2020-11-17T00:23:13Z   0.2%   267.6M of 1G   160K of 1G

So this could be just eirini slow to catch up with the node status.

However, @troytop mentioned this doesn't recover on CaaSP without a manual app restart. @viccuad or @svollath do you mind trying a repro of this on CaaSP?

viovanov commented 3 years ago

ping @jimmykarily

jimmykarily commented 3 years ago

Other the event reporter error everything else seems to be fine? I mean, the app instance appears as crashed at some point and eventually recovers. Regarding the event reporter error, I remember seeing that before even when nothing else seems wrong. I think the first instance of the app is missing the app index at the end of the pod name (-0 etc). Should it always be there @cloudfoundry-incubator/eirini ?

jimmykarily commented 3 years ago

Created a story to investigate the event reporter errors here:

The rest of this issue may be irrelevant though.

kieron-dev commented 3 years ago

Hi. Here's the eirini team's findings from the above story:

The event-reporter was changed in eirini-1.9 and it now only listens to updates on pods labelled with APP. The error you saw was probably to do with a staging pod. The event reporter will ignore these now.

We've experimented with deleting a k8s node. Eirini actually isn't involved in recreation of any apps. It's purely a k8s concern. We see k8s successfully rescheduling lost pods on remaining nodes as soon as it's aware of the deleted node disappearing.

So it looks like everything is behaving correctly, and there's nothing for us to do in eirini for this.

jandubois commented 3 years ago

The event-reporter was changed in eirini-1.9 and it now only listens to updates on [...]

Just pointing out that kubecf is still using Eirini-1.8, in case that is relevant.

viccuad commented 3 years ago

Reproduced on CaaSP 4.5.1, (tf4 machine, 3 worker nodes). Everything seems fine from the CAP and Eirini side.

Hard-powering a node off means that CaaSP marks the node as NotReady, and marks the pods as Terminating, yet it never finishes terminating them. Hence, they don't get moved. Depending on your luck, a cap HA deployment may survive. Cordoning the node after losing it makes no change. Cordoning the node instead of a hard-poweroff moves the pods to other nodes correctly.

troytop commented 3 years ago

Looks like a Kubernetes config/distro issue rather than a KubeCF one. We can provide docs advice on how to recover from a hard shutdown (i.e. disaster recovery). Need to tell people in CAP Docs (probably beyond the scope of kubecf docs) how to replace or remove missing kubernetes nodes so that kubecf recovers properly. If this is something that can be automated (e.g. with the CCM or external Kubernetes monitoring) that should be mentioned with a reference or link to supporting information.

satadruroy commented 3 years ago

If we kubectl delete the node the workload should move over to a healthy node.

@viccuad once you brought the node back up did the apps recover?

viccuad commented 3 years ago

Yes, replicated it and they do.