Open gaktive opened 3 years ago
I was able to repro this - sort of. My setup - AKS with 3 nodes, Eirini HA deployment.
cf apps
#0 running 2020-11-17T00:23:11Z 0.3% 264.6M of 1G 160K of 1G
#1 running 2020-11-17T00:23:11Z 0.2% 274.9M of 1G 160K of 1G
#2 running 2020-11-17T00:23:12Z 0.2% 267.2M of 1G 160K of 1G
Check to make sure the pods with the apps are scheduled on 3 different nodes.
cf apps
output snippet:
state since cpu memory disk details
#0 running 2020-11-17T00:23:11Z 0.0% 0 of 1G 0 of 1G
#1 running 2020-11-17T00:23:11Z 0.2% 275M of 1G 160K of 1G
#2 running 2020-11-17T00:23:12Z 0.2% 267.4M of 1G 160K of 1G
The output of #0 shows running whereas in reality it is stopped as the node is stopped.
state since cpu memory disk details
#0 crashed 2020-11-17T00:23:12Z 0.0% 0 of 1G 0 of 1G
#1 running 2020-11-17T00:23:12Z 0.2% 275.2M of 1G 160K of 1G
#2 running 2020-11-17T00:23:13Z 0.2% 267.6M of 1G 160K of 1G
... even though the pods are back up and running.
The events-reporter
in eirini-events
namespace has the following error:
But eventually it recovers on its own.
state since cpu memory disk details
#0 running 2020-11-17T01:36:41Z 0.3% 319.9M of 1G 164K of 1G
#1 running 2020-11-17T00:23:12Z 0.2% 275.4M of 1G 160K of 1G
#2 running 2020-11-17T00:23:13Z 0.2% 267.6M of 1G 160K of 1G
So this could be just eirini slow to catch up with the node status.
However, @troytop mentioned this doesn't recover on CaaSP without a manual app restart. @viccuad or @svollath do you mind trying a repro of this on CaaSP?
ping @jimmykarily
Other the event reporter error everything else seems to be fine? I mean, the app instance appears as crashed at some point and eventually recovers.
Regarding the event reporter error, I remember seeing that before even when nothing else seems wrong. I think the first instance of the app is missing the app index at the end of the pod name (-0
etc). Should it always be there @cloudfoundry-incubator/eirini ?
Created a story to investigate the event reporter errors here: https://www.pivotaltracker.com/story/show/175814747
The rest of this issue may be irrelevant though.
Hi. Here's the eirini team's findings from the above story:
The event-reporter was changed in eirini-1.9 and it now only listens to updates on pods labelled with cloudfoundry.org/source_type: APP. The error you saw was probably to do with a staging pod. The event reporter will ignore these now.
We've experimented with deleting a k8s node. Eirini actually isn't involved in recreation of any apps. It's purely a k8s concern. We see k8s successfully rescheduling lost pods on remaining nodes as soon as it's aware of the deleted node disappearing.
So it looks like everything is behaving correctly, and there's nothing for us to do in eirini for this.
The event-reporter was changed in eirini-1.9 and it now only listens to updates on [...]
Just pointing out that kubecf is still using Eirini-1.8, in case that is relevant.
Reproduced on CaaSP 4.5.1, (tf4 machine, 3 worker nodes). Everything seems fine from the CAP and Eirini side.
Hard-powering a node off means that CaaSP marks the node as NotReady, and marks the pods as Terminating, yet it never finishes terminating them. Hence, they don't get moved. Depending on your luck, a cap HA deployment may survive. Cordoning the node after losing it makes no change. Cordoning the node instead of a hard-poweroff moves the pods to other nodes correctly.
Looks like a Kubernetes config/distro issue rather than a KubeCF one. We can provide docs advice on how to recover from a hard shutdown (i.e. disaster recovery). Need to tell people in CAP Docs (probably beyond the scope of kubecf docs) how to replace or remove missing kubernetes nodes so that kubecf recovers properly. If this is something that can be automated (e.g. with the CCM or external Kubernetes monitoring) that should be mentioned with a reference or link to supporting information.
If we kubectl delete
the node the workload should move over to a healthy node.
@viccuad once you brought the node back up did the apps recover?
Yes, replicated it and they do.
Describe the bug Placeholder until we get @satadruroy to confirm what @troytop spotted -- when a Kubernetes node goes down with KubeCF using Eirini, once the node comes back up, all apps associated with that node need to be manually started. This should be an automatic process as we see in Diego.
To Reproduce
Expected behavior When a node goes down and comes back up, apps come back automatically.