aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
827 stars 312 forks source link

Cluster with an external Slurmdbd accounting #6362

Open mclouds2020 opened 1 month ago

mclouds2020 commented 1 month ago

By following the below link we are able to setup the Slurm external DB and cluster https://docs.aws.amazon.com/parallelcluster/latest/ug/external-slurmdb-accounting.html

By leveraging slurm external db many slurm clusters can talk to single database and there is no need of running slurmdbd service in every head node. Just wanted to know is there any other advanatage of using this functionality.

Also, wanted to know whether below usecase can be achieved using slurm external db.

Assume cluster named Slurmdemo existed and it got deleted for some reasons so a new cluster is created with same name and it is pointing to the same external db. In this case, we want the Jobs ids of the new cluster to continue from the previous cluster Job ids.

We tested the above usecase and it does not work. New clusters Job ids are always starting from 1.

hanwen-pcluste commented 1 month ago

Sorry for the late reply!

Slurm recommends to use a single Slurmdbd daemon to communicate with the database. Therefore, the new external Slurmdbd feature extracts the Slurmdbd daemon from head nodes to a single instance. Therefore, the database is communicating with a single Slurmdbd daemon. And the Slurmdbd daemon is communicating with multiple clusters. This improvement is aimed at ensuring the consistency of the database. There are no other functionality advantages.

We are not aware of an automated continuation of job ID. However, you can specify FirstJobId in slurm.conf to customize the first job ID. ParallelCluster supports slurm.conf customization