flux-framework / flux-operator

Deploy a Flux MiniCluster to Kubernetes with the operator
https://flux-framework.org/flux-operator/
MIT License
30 stars 8 forks source link

Containers should be resilient to starting in any order #35

Closed vsoch closed 1 year ago

vsoch commented 1 year ago

This was feedback from the batch mailing list:

I think the general stance is the containers should be resilient to start in a different order.

https://groups.google.com/a/kubernetes.io/g/wg-batch/c/u3eIlyo4F3g/m/0JQz2FU8BAAJ

This means that we likely need something in Flux that is more resilient than what is currently supported. What currently happens is:

I haven't yet tested more rigorously with the RESTFul server - the above is based on launching a job command and then having the main broker start with only a subset of the workers ready. Another thing I saw was that sometimes the workers would come online, and then for some reason go offline, come online again, and not be able to re-register. I'm going to need to test this many times again with the RESTful API - it could be just starting that server (via flux start) gives more flexibility because we launch the job after. Even if we get something to work with, say 4 pods, I'm more worried about the case when we scale and we have one broker for hundreds? More? worker pods.

But notably - the above is not resilient to containers starting in any order. We would need to be able for the flux broker (and workers) to be flexible to different orders and timing. What I think I'd like to be able to do is to start the flux broker, and then have it be able to dynamically add existing (or TBA workers) as they come up.

garlick commented 1 year ago

On a cluster, when flux runs as the native resource manager, we build the entire configuration including network and resource topology in advance and then

It's always been planned to allow Flux instances and Flux jobs to grow, meaning add resources previously unknown to it and configure the overlay dynamically, but it's a fairly big chunk of new design, so it will likely take a back seat to other issues we are facing with rolling Flux out in production on progressively larger commodity clusters, and then on coral2 (IOW not for a while).

So it's advantageous in the near term if we can make the cloud fit the current Flux design rather than the other way around. Maybe some kind of fixed virtual network config could be layered on top of the dynamic physical one and Flux could pretend it's on a big statically configured cluster with many of the nodes offline (even if they may never be online)?

vsoch commented 1 year ago

Ah gotcha! So I think we are following this topology - we are preparing the config to register all the eventual nodes, even if they aren't online yet. And then the behavior I'm seeing (and was worried about) with "launch the job immediately' is likely this point:

Flux starts running as soon as rank 0 is up (jobs could be submitted for example)

In simpler terms, rank 0 is up, starts the job, and other nodes aren't ready yet and they aren't used. So this is potentially a drawback to what I'm calling the ephemeral launch approach, because we might miss using some nodes (details of ephemeral vs persistent here under command).

But this also means if we start the server and there is some delay for the other nodes to come up (the persistent cluster with the RESTful API), by the time the user submits the jobs, maybe we will use all of the nodes available. I will do more testing this week - I finished the docs today, so now I'm setup to try to containerize more HPC workloads and see what kind of stuff i can do!

Thanks for the feedback @garlick and good to know this dynamic-ism is on the queue for the future!

garlick commented 1 year ago

Just to be sure we're on the same page, and at the risk of repeating things already known:

Also: I thought the problem was that the rank 0 broker couldn't be started until all the pods were online, because only then would you have all the IP addresses to put in the static config? If that is correct, then could this work somehow?

Maybe some kind of fixed virtual network config could be layered on top of the dynamic physical one and Flux could pretend it's on a big statically configured cluster with many of the nodes offline (even if they may never be online)?

What I meant was maybe virtual IPs and interfaces could be hardwired in the config so that rank 0 could start before anything else, and once other pods joined the vitual network, their predetermined IPs could just start working? (apologies if I'm doubling down on an ignorant comment :-)

vsoch commented 1 year ago

This is fixed - we've now run experiments with 64 nodes and having the pods come up in random order with respect to the broker. The strategy we use is to put the command for the worker to start in a loop (with some sleep) so when the broker is up it will eventually connect, and then it can exit when the broker finishes running the job.