Containers should be resilient to starting in any order

vsoch commented 1 year ago

This was feedback from the batch mailing list:

I think the general stance is the containers should be resilient to start in a different order.

https://groups.google.com/a/kubernetes.io/g/wg-batch/c/u3eIlyo4F3g/m/0JQz2FU8BAAJ

This means that we likely need something in Flux that is more resilient than what is currently supported. What currently happens is:

All pods come online and don't see one another in /etc/hosts, so they wait with sleeps. What is happening on the operator is that as the pods are coming online, we are able to get their ip-addresss
When we see all pods, we update them with a shared /etc/hosts (and this requires a restart)
The pods restart, see the /etc/hosts, and then the worker pods (non broker / index 0) wait to see the shared munge key
All pods come online and wait until they see N= in their /etc/hosts. This is to ensure networking.
The pods then either identify as the main pod (which either installs and starts the restful server OR runs a user command), or if not, sleeps for 15 seconds to (hopefully) wait for the main one to start.
Main server (index 0, broker) is online by the time the others start, and they register
The cluster is up and we run the commands

I haven't yet tested more rigorously with the RESTFul server - the above is based on launching a job command and then having the main broker start with only a subset of the workers ready. Another thing I saw was that sometimes the workers would come online, and then for some reason go offline, come online again, and not be able to re-register. I'm going to need to test this many times again with the RESTful API - it could be just starting that server (via flux start) gives more flexibility because we launch the job after. Even if we get something to work with, say 4 pods, I'm more worried about the case when we scale and we have one broker for hundreds? More? worker pods.

But notably - the above is not resilient to containers starting in any order. We would need to be able for the flux broker (and workers) to be flexible to different orders and timing. What I think I'd like to be able to do is to start the flux broker, and then have it be able to dynamically add existing (or TBA workers) as they come up.

garlick commented 1 year ago

On a cluster, when flux runs as the native resource manager, we build the entire configuration including network and resource topology in advance and then

nodes/brokers can start in any order
Flux starts running as soon as rank 0 is up (jobs could be submitted for example)
The scheduler is kept informed of what resources are offline/online and starts jobs when it can
On a larger cluster we can configure the overlay topology to include "routers" (interior tree nodes) to alleviate the 1:N scaling somewhat (should be good to thousands of nodes)

It's always been planned to allow Flux instances and Flux jobs to grow, meaning add resources previously unknown to it and configure the overlay dynamically, but it's a fairly big chunk of new design, so it will likely take a back seat to other issues we are facing with rolling Flux out in production on progressively larger commodity clusters, and then on coral2 (IOW not for a while).

So it's advantageous in the near term if we can make the cloud fit the current Flux design rather than the other way around. Maybe some kind of fixed virtual network config could be layered on top of the dynamic physical one and Flux could pretend it's on a big statically configured cluster with many of the nodes offline (even if they may never be online)?

vsoch commented 1 year ago

Ah gotcha! So I think we are following this topology - we are preparing the config to register all the eventual nodes, even if they aren't online yet. And then the behavior I'm seeing (and was worried about) with "launch the job immediately' is likely this point:

Flux starts running as soon as rank 0 is up (jobs could be submitted for example)

In simpler terms, rank 0 is up, starts the job, and other nodes aren't ready yet and they aren't used. So this is potentially a drawback to what I'm calling the ephemeral launch approach, because we might miss using some nodes (details of ephemeral vs persistent here under command).

But this also means if we start the server and there is some delay for the other nodes to come up (the persistent cluster with the RESTful API), by the time the user submits the jobs, maybe we will use all of the nodes available. I will do more testing this week - I finished the docs today, so now I'm setup to try to containerize more HPC workloads and see what kind of stuff i can do!

Thanks for the feedback @garlick and good to know this dynamic-ism is on the queue for the future!

garlick commented 1 year ago

Just to be sure we're on the same page, and at the risk of repeating things already known:

if the broker is running with the -Sbroker.quorum=0 option then the rank 0 broker starts by itself, with only itself available to run jobs. Jobs requiring more nodes can still be submitted and will remain pending until enough nodes come online for the scheduler to fulfill the jobs' resource requirements. This is how we run the clusters (we have to be able to function with nodes offline)
You can avoid potential races between restful server startup and broker startup by running the restful server as the initial program of the broker. Since that's how batch jobs work, you can be sure that the initial program isn't started until Flux is fully operational.
On a cluster, systemd waits for the network to be fully configured before starting the broker, and then it automatically restarts brokers that fail to start up for some reason. Maybe that masks start up problems that you're dealing with here, and something needs to provide that capability in the pod if systemd is unavailable?

Also: I thought the problem was that the rank 0 broker couldn't be started until all the pods were online, because only then would you have all the IP addresses to put in the static config? If that is correct, then could this work somehow?

Maybe some kind of fixed virtual network config could be layered on top of the dynamic physical one and Flux could pretend it's on a big statically configured cluster with many of the nodes offline (even if they may never be online)?

What I meant was maybe virtual IPs and interfaces could be hardwired in the config so that rank 0 could start before anything else, and once other pods joined the vitual network, their predetermined IPs could just start working? (apologies if I'm doubling down on an ignorant comment :-)

vsoch commented 1 year ago

This is fixed - we've now run experiments with 64 nodes and having the pods come up in random order with respect to the broker. The strategy we use is to put the command for the worker to start in a loop (with some sleep) so when the broker is up it will eventually connect, and then it can exit when the broker finishes running the job.

flux-framework / flux-operator

Containers should be resilient to starting in any order #35