Feature request : high-availability

camille-rodriguez commented 1 year ago

When a kubernetes node goes down, it would be great if the kubeflow pods would evacuate and respawn on another node. At the moment, pods get stuck at "terminating" and the failover never completes. This hinders the natural HA capability of kubernetes. How to test: shut down a node hosting kubeflow pods. What currently works is if you reboot the node, then the pods eventually all come back when the node comes back online.

moula commented 1 year ago

@camille-rodriguez I agree with you, although I think the problem is mainly the high availability of Juju-Controller. As with my cluster with 3 microk8s nodes, it only works on the first if it goes down there is no more Juju-Controller on the other two!!!! Thank's.

camille-rodriguez commented 1 year ago

Hi @moula , agreed this is also a problem. The juju bug is reported here https://bugs.launchpad.net/juju/+bug/1849030

kimwnasptd commented 10 months ago

Hey @camille-rodriguez, was taking a look at this issue.

This behaviour you describe is rooted in the fact that the Charms are using a K8s StatefulSet under the hood. I'll make the assumption that when you mention "a node goes down" you mean something happened, i.e. it crashed, and not that an admin explicitly removed the Node object from the K8s API Server.

Adding some references from the K8s docs/issues

So in this case what's happening is the expected behaviour, in terms of K8s. By K8s best practices, if a node is unexpectedly down, and we expect it do be like this and not come up again, then an admin should delete the Node object from the API server. Else, we expect the node to be up again and the stateful workloads to "re-appear" to K8s.

Now then there's the question; are all the Charms stateful and they need to run as a StatefulSet and not instead as a Deployment (in which case a Pod would get created elsewhere)?

I expect not, and maybe it makes sense to have an option in the Charms to be able to configure this. But I'd confirm with the Juju team

canonical / bundle-kubeflow

Feature request : high-availability #561