Interested in elastic deployments on Slurm

JBlaschke commented 3 months ago

I'm interested in dynamically adding an removing nodes to a Flux deployment running in Slurm. I'm aware that similar functionality exists for K8s (https://flux-framework.org/flux-operator/). Since Slurm is the primary scheduler for NERSC's current and next HPC resources, we're interested in running Flux within an elastic Slurm allocation.

Happy to continue the discussion that I started with @milroy on Slack (https://llnl-performance.slack.com/archives/CBNV6RG8Y/p1723574723197099); and happy to contribute time/test things on Perlmutter.

Tagging @namehta4 who's interested in this also.

grondo commented 3 months ago

Currently, a Flux instance will continue running if it loses a broker rank (node) that is not deemed critical. Critical ranks include rank 0 (the first node of the instance) and also any interior nodes (they route messages). To make a Flux instance as resilient as possible to removed/lost nodes, you can configure the overlay in a flat tree by setting the tbon.topo broker attribute with a large kary number, e.g. kary:255 if your instance has 256 nodes or less (the default is kary:32). (broker attributes are set with the flux start -S, --setattr option).

Dynamically adding nodes, however, is an area of ongoing work which has unfortunately taken a backseat to other important development recently.

One potential way to accomplish this would be to set up a bootstrap configuration with all possible nodes when first launching a Flux instance. All nodes not currently running a Flux broker would appear as down in the instance. When new nodes are added, the Flux broker could use this configuration to be added to the instance.

There's also a solution proposed in #5184, but this was never merged. This method allows starting an instance with a larger size than actual, and adding arbitrary nodes up to that size. Missing nodes initially will have placeholder hostnames, and new brokers join by using ssh to connect to the existing rank 0 broker.

If there is interest in this second method, you can feel free to test on that branch, though we should rebase the PR first to ensure you're working with the most recent version of Flux.

JBlaschke commented 3 months ago

Hi @grondo thanks for the info!

Dynamically adding nodes, however, is an area of ongoing work which has unfortunately taken a backseat to other important development recently.

I understand. We've all been there 😄

Here are my thoughts so far:

My instinct is to go with #5184 -- or something similar -- in the long run as Perlmutter has a lot of nodes, so it might be a significant burden to maintain a bootstrap configuration for the whole system (and update it whenever the system changes).
Creating a bootstrap configuration from a reservation might be an alternate approach. One application I'm thinking off is where a large reservation is allowed to be filled by preemptible jobs. When an unpredictable urgent / realtime job comes in, the preemptible jobs are kicked from the reservation, and those nodes are given to Flux. From Flux's perspective those nodes would be down, then up, then down again (after the urgent job is done, and we allow the reservation to fill with preemptible jobs again). From my reading of the docs and what @grondo mentioned, this should be possible with the latest Flux release, right?

Critical ranks include rank 0 (the first node of the instance) and also any interior nodes (they route messages).

My reading of this is that regardless of whether we try creating a bootstrap configuration, or the solution in #5184, we need to make sure that rank 0 keeps running. I.e the whole cluster is limited by the wallclock of the original job that started the Flux cluster right? I.e. we cannot "hand off" the broker to another slurm job?

I'll have to think about how #5184 works -- and then I would be happy to do some testing. Based on your description, something gives me pause:

new brokers join by using ssh to connect to the existing rank 0 broker

We discourage using ssh at scale -- is there a way we can make the new brokers communicate with the rank 0 broken via tcp?

Cheers, Johannes

grondo commented 3 months ago

My instinct is to go with https://github.com/flux-framework/flux-core/pull/5184 -- or something similar -- in the long run as Perlmutter has a lot of nodes, so it might be a significant burden to maintain a bootstrap configuration for the whole system (and update it whenever the system changes).

Yes, that or something like it is definitely the long term plan. Ideally resources can be dynamically added without needing placeholders, but that requires some engineering that hasn't been planned yet unfortunately.

From Flux's perspective those nodes would be down, then up, then down again (after the urgent job is done, and we allow the reservation to fill with preemptible jobs again). From my reading of the docs and what @grondo mentioned, this should be possible with the latest Flux release, right?

Yes, that is something that could work now, with a bit of work to start/stop brokers with the correct arguments and config, etc.

My reading of this is that regardless of whether we try creating a bootstrap configuration, or the solution in https://github.com/flux-framework/flux-core/pull/5184, we need to make sure that rank 0 keeps running. I.e the whole cluster is limited by the wallclock of the original job that started the Flux cluster right? I.e. we cannot "hand off" the broker to another slurm job?

It is possible to shutdown a flux instance and restart it, but I think currently there would be trouble moving the broker to a different node. One short-term solution might be to keep rank 0 on a login node or similar where the broker could keep running. That rank could be excluded from running jobs, much like is done for a system instance of Flux where rank 0 runs on a management node.

We discourage using ssh at scale -- is there a way we can make the new brokers communicate with the rank 0 broken via tcp?

Well, all options are open at this time, I think #5184 is a kind of proof-of-concept at this time. Part of the reason brokers in this approach have to ssh back to the existing instance is so that keys and configuration can be shared. It is difficult to imagine how to do a secure key exchange over TCP without already having keys. SSH is nice and simple (if passwordless ssh is enabled for the user), and the hope is that only a few nodes at a time would be joining in this solution.

But, other solutions are possible I'm sure! @garlick is off this week, but I'm sure he'll have some input here when he returns.

garlick commented 3 months ago

Just curious, what kind of scale might we be talking about for #5184? Adding 1K nodes all at once? More?

When adding a broker there are actually two connections involved in the bootstrap:

to rank 0 to provision a new rank and obtain URI of overlay network parent
to overlay network parent to exchange keys

It is 1) where adding N nodes creates N connections to one sshd, and I don't think that really needs to be secured. So maybe there is a way to do that without ssh? 2) will respect the overlay fanout, although a caveat is that adding a layer to the overlay tree does promote a set of nodes to a critical router role, even as it improves scalability.

garlick commented 3 months ago

Thinking about it a bit more, that first handshake really does need to be authenticated to the instance owner, or it becomes a trivial DoS target. The handshake itself is lightweight so I would expect performance to be dominated by ssh connection establishment. I would think it would sort itself out pretty quickly if say 1K brokers were added at once. Maybe we could continue as designed and see if we have a real life problem there.

grondo commented 3 months ago

Just checking, there are no keys exchanged in that initial handshake? Sorry if I was misinformed and said there were when that wasn't true.

garlick commented 3 months ago

there are no keys exchanged in that initial handshake?

Oh wait I forgot something important about the current/proposed design! The two handshakes use the same rank 0 broker connection, even though the first one is an RPC to rank 0 and the second one is an RPC to the parent rank. Thank you for calling me out on that one - I missed that fact in my initial review of the code.

So as far as ssh connections go, there is just the one to rank 0.

Sorry about that!

flux-framework / flux-core

Interested in elastic deployments on Slurm #6213