aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
830 stars 312 forks source link

Slurm memory specification - Main Thread #2198

Closed rexcsn closed 2 years ago

rexcsn commented 4 years ago

Opening this issue as the main thread to collect slurm memory related info/concern/workarounds.

Issue

As mentioned by previously opened issues such as https://github.com/aws/aws-parallelcluster/issues/1517 and https://github.com/aws/aws-parallelcluster/issues/1714, due to changes from slurm, nodes for pcluster>=v2.5.0 are not configured with RealMemory information. As a result, ParallelCluster currently does not support scheduling with memory options with slurm.

Workarounds

For pcluster>=v2.5.0<v2.9.0, workaround outlined here can be used to configure memory for cluster containing only 1 compute instance type.

For pcluster>=v2.9.0, multiple queue mode is introduced, and a cluster can now have multiple compute instance types. Old workaround can still be used for cluster with only 1 compute instance type. Here are the updated instruction on how to configure memory for multiple instance types in pcluster>=v2.9.0:

NodeName=queue1-dy-m54xlarge-[1-10] CPUs=16 State=CLOUD Feature=dynamic,m5.4xlarge RealMemory=60000 ...


* Note that ideally we should just use the `RealMemory` info we got from `/opt/slurm/sbin/slurmd -C`, but `RealMemory` might be different for different machines. If configured `RealMemory` is larger than the actual seen by `/opt/slurm/sbin/slurmd -C` when a new node launches, the node will be placed into `DRAIN` state by slurm automatically. To be safe, we want to round down the value.
* In `/opt/slurm/etc/slurm.conf` change `SelectTypeParameters` from `CR_CPU` to `CR_CPU_Memory`
* [Optional] pcluster's `clustermgtd` process will replace/terminate `DRAINED` nodes automatically, to disable this functionality and avoid nodes getting terminated automatically when setting up memory, add `terminate_drain_nodes = False` to `clustermgtd` configuration file at `/etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf`. Once setup is finished, we can remove or set `terminate_drain_nodes = True` to restore fully `clustermgtd` functionalities.
* Restart `slurmd` on compute nodes and `slurmctld` on head node, and we should see that memory is configured in `scontrol show nodes`

### Further discussion
We understand that workarounds for this feature maybe difficult to setup manually.
Official support for this feature is not currently planned because there is not a good way to retrieve `RealMemory` from nodes and configure this information prior to launching the cluster. In addition, there is currently no way for slurm to configure this information for nodes automatically.
We will continue to evaluate on ways to add support for this feature.

Thank you!
rkarnik-kymeratx commented 3 years ago

Suggestion for possible implementation of scalable workaround: Add a per-queue parameter so users can specify RealMemory for each queue at configuration time. Use this parameter to set up

/opt/slurm/etc/pcluster/slurm_parallelcluster_<PARTITION_NAME>_partition.conf

accordingly. Am I missing something, or would this not work for some reason? Trying to automate by figuring out RealMemory for each node type seems impossible.

bollig commented 3 years ago

related to https://github.com/dask/dask-jobqueue/issues/497

gearlbrace commented 3 years ago

How does pcluster determine number of CPUs= and Gres= when building its /opt/slurm/etc/pcluster/*partition.conf files?

Can't that same process be applied to determine RealMemory with a conservative value of say... 90% of physical memory for a particular instance type?

lukeseawalker commented 2 years ago

ParallelCluster 3.2.0 with support for memory-based job scheduling in Slurm has been released. See release notes https://github.com/aws/aws-parallelcluster/releases/tag/v3.2.0