Slurm memory specification - Main Thread

rexcsn commented 4 years ago

Opening this issue as the main thread to collect slurm memory related info/concern/workarounds.

Issue

As mentioned by previously opened issues such as https://github.com/aws/aws-parallelcluster/issues/1517 and https://github.com/aws/aws-parallelcluster/issues/1714, due to changes from slurm, nodes for pcluster>=v2.5.0 are not configured with RealMemory information. As a result, ParallelCluster currently does not support scheduling with memory options with slurm.

Workarounds

For pcluster>=v2.5.0<v2.9.0, workaround outlined here can be used to configure memory for cluster containing only 1 compute instance type.

For pcluster>=v2.9.0, multiple queue mode is introduced, and a cluster can now have multiple compute instance types. Old workaround can still be used for cluster with only 1 compute instance type. Here are the updated instruction on how to configure memory for multiple instance types in pcluster>=v2.9.0:

Determine the RealMemory available in the compute instance. We can get this by ssh into an available/online compute node and running /opt/slurm/sbin/slurmd -C, we should see something like RealMemory=<SOME_NUMBER> in the output.
Note that since we have multiple compute instance types, we will need to repeat step 1 for every instance type to get RealMemory information for each instance type.
Once we have the RealMemory information we need to add this information to the corresponding nodes in each queue/partition. We can do this by modifying the partition configuration file, located at /opt/slurm/etc/pcluster/slurm_parallelcluster_<PARTITION_NAME>_partition.conf.
Append RealMemory=<CHOSEN_MEMORY> to NodeName=<YOUR_NODE_NAME> ... entry for each instance type in each queue/partition.
For example, if I want to configure RealMemory=60000 for my nodes queue1-dy-m54xlarge-[1-10]. I would modify /opt/slurm/etc/pcluster/slurm_parallelcluster_queue1_partition.conf, and the modified file should look like:
```
$ cat /opt/slurm/etc/pcluster/slurm_parallelcluster_queue1_partition.conf 
# This file is automatically generated by pcluster
```

NodeName=queue1-dy-m54xlarge-[1-10] CPUs=16 State=CLOUD Feature=dynamic,m5.4xlarge RealMemory=60000 ...


* Note that ideally we should just use the `RealMemory` info we got from `/opt/slurm/sbin/slurmd -C`, but `RealMemory` might be different for different machines. If configured `RealMemory` is larger than the actual seen by `/opt/slurm/sbin/slurmd -C` when a new node launches, the node will be placed into `DRAIN` state by slurm automatically. To be safe, we want to round down the value.
* In `/opt/slurm/etc/slurm.conf` change `SelectTypeParameters` from `CR_CPU` to `CR_CPU_Memory`
* [Optional] pcluster's `clustermgtd` process will replace/terminate `DRAINED` nodes automatically, to disable this functionality and avoid nodes getting terminated automatically when setting up memory, add `terminate_drain_nodes = False` to `clustermgtd` configuration file at `/etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf`. Once setup is finished, we can remove or set `terminate_drain_nodes = True` to restore fully `clustermgtd` functionalities.
* Restart `slurmd` on compute nodes and `slurmctld` on head node, and we should see that memory is configured in `scontrol show nodes`

### Further discussion
We understand that workarounds for this feature maybe difficult to setup manually.
Official support for this feature is not currently planned because there is not a good way to retrieve `RealMemory` from nodes and configure this information prior to launching the cluster. In addition, there is currently no way for slurm to configure this information for nodes automatically.
We will continue to evaluate on ways to add support for this feature.

Thank you!

rkarnik-kymeratx commented 3 years ago

Suggestion for possible implementation of scalable workaround: Add a per-queue parameter so users can specify RealMemory for each queue at configuration time. Use this parameter to set up

/opt/slurm/etc/pcluster/slurm_parallelcluster_<PARTITION_NAME>_partition.conf

accordingly. Am I missing something, or would this not work for some reason? Trying to automate by figuring out RealMemory for each node type seems impossible.

bollig commented 3 years ago

gearlbrace commented 3 years ago

How does pcluster determine number of CPUs= and Gres= when building its /opt/slurm/etc/pcluster/*partition.conf files?

Can't that same process be applied to determine RealMemory with a conservative value of say... 90% of physical memory for a particular instance type?

lukeseawalker commented 2 years ago

ParallelCluster 3.2.0 with support for memory-based job scheduling in Slurm has been released. See release notes https://github.com/aws/aws-parallelcluster/releases/tag/v3.2.0

aws / aws-parallelcluster

Slurm memory specification - Main Thread #2198

Issue

Workarounds