Open camille-rodriguez opened 1 year ago
@camille-rodriguez I agree with you, although I think the problem is mainly the high availability of Juju-Controller. As with my cluster with 3 microk8s nodes, it only works on the first if it goes down there is no more Juju-Controller on the other two!!!! Thank's.
Hi @moula , agreed this is also a problem. The juju bug is reported here https://bugs.launchpad.net/juju/+bug/1849030
Hey @camille-rodriguez, was taking a look at this issue.
This behaviour you describe is rooted in the fact that the Charms are using a K8s StatefulSet under the hood. I'll make the assumption that when you mention "a node goes down" you mean something happened, i.e. it crashed, and not that an admin explicitly removed the Node
object from the K8s API Server.
Adding some references from the K8s docs/issues
So in this case what's happening is the expected behaviour, in terms of K8s. By K8s best practices, if a node is unexpectedly down, and we expect it do be like this and not come up again, then an admin should delete the Node
object from the API server. Else, we expect the node to be up again and the stateful workloads to "re-appear" to K8s.
Now then there's the question; are all the Charms stateful and they need to run as a StatefulSet and not instead as a Deployment (in which case a Pod would get created elsewhere)?
I expect not, and maybe it makes sense to have an option in the Charms to be able to configure this. But I'd confirm with the Juju team
When a kubernetes node goes down, it would be great if the kubeflow pods would evacuate and respawn on another node. At the moment, pods get stuck at "terminating" and the failover never completes. This hinders the natural HA capability of kubernetes. How to test: shut down a node hosting kubeflow pods. What currently works is if you reboot the node, then the pods eventually all come back when the node comes back online.