clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Support for Placement Groups #32

Open ghost opened 3 years ago

ghost commented 3 years ago

As part of Terraform/#63 (AWS EFA support), support for AWS Placement groups are required. I've been contemplating this a bit recently, as placement groups (AWS, Azure) and GCP Group Placement Policies are somewhat important good performance with certain HPC jobs.

Placement groups are a great match to a single HPC job, or a static set of nodes. They're not really conducive to very elastic environments, or environments where you may mix & match instance types. While they can work there, you're just more likely to get capacity issues and instances failing to launch.

There are also some restrictions that are challenging to support:

Thus, placement groups need to be a somewhat optional feature, and it would be nice to treat both AWS and GCP similarly, even though they have different restrictions.

I don't believe that we can create the placement groups as part of the Terraform process, as at that point, limits.yaml doesn't exist, and we don't know how big the cluster could be (affects GCP).

I don't believe that we can create the placement groups as part of the SLURM ResumeProgram call to startnode.py, as this isn't directly linked to a single job. Creating a group for every startnode call will get messy as the nodes not all terminate at a set time, so cleanup becomes a challenge. That said, I do believe that startnode ought to change to enable all the nodes which SLURM wishes to start at once be done in a single API call - it's more likely that the cloud scheduler will be able to find space for the set of nodes, placed compactly (in the placement group) if they are all started in a single call.

Suggesting course

I'm currently thinking that making changes to update_config.py is our best spot for creating for placement groups. Each call to update_config could clean up/terminate existing placement groups that are part of our ${cluster_id}, and create new placement group(s).

I feel like creating a placement group per shape defined in limits.yaml would make the most sense. This way, we would, for example, group C5n instances together, and group C6gn instances together, without trying to get AWS to find a way to compactly mix ARM and x86 instances.

We would also want to update startnode to add the placement policy to the instance starts, in the case where we have placement group created. (ie, we wouldn't create them for AWS t3a instances, as they're burstable, or n1 instances on GCP).

Is there already work in progress to support Placement Groups? If not, does my suggested course of action seem reasonable? I can work on this, and offer patches, but I wanted to make sure that the plan seems reasonable to the core team first.

milliams commented 3 years ago

I agree that placement groups will need to be considered alongside high-performance networking.

You have identified the main problem with implementing them which is the difference in work-mode between how Slurm is usually configured (a fixed list of nodes with names and properties) and being able to create nodes for a particular job (we fake it by using CLOUD nodes, but they must all be defined in advance). Until now, this has not been a problem since all nodes are independent and can work just as well for one job as another.

Are you imagining creating a set of nodes within a placement group for every "HPC" job that is submitted? Would they then be available for reuse by another job or would a new set be created?

If we submit two jobs (each wanting 10 nodes) using the same instance type, would that go to two different placement groups or would it put the new nodes into the existing placement group?

I wonder if Slurm's job-submit plugins could help here? I've played with them in the past and have written a plugin which allows you to write them in Python (rather than C or Lua) at slurm-job-submit-python. These plugins allow you to add any information you want into a job's definition so could be used to dynamically add reservations, node lists, constraints etc. to a job.

I haven't started any work on this so I welcome you to start looking into it. I imagine that answers to some of the questions I have raised above will only become clear as the work progresses.

ghost commented 3 years ago

In the "Cloud-Ideal" world, for each "HPC" job, we would create a new placement group, and start nodes in that group. This does require then that an instantiation of an instance is not then re-used between jobs. If you've lots of smaller jobs, this overhead could become problematic and expensive.

The tradeoff option I am suggesting is that we create a placement group per instance type. All nodes of type 'X' will be added to the placement group for type 'X'. As jobs start and stop, Slurm creates and destroys nodes as it normally does.

To specifically answer your questions: