Open JosephDVL opened 11 months ago
I've found a reference to an old Slurm mailing list post (https://groups.google.com/g/slurm-users/c/kshbXbqpEIY/m/nJcTyVQiIAAJ) which seems to address the issue. The fix in the e-mail does work for this case:
scontrol update jobid=240 NumNodes=1-1
Per the e-mail, if the job script includes “-N 1” everything works correctly.
Required Info:
config.yaml
pcluster describe-cluster
command.describe-cluster
``` { "creationTime": "2023-09-29T16:15:44.319Z", "headNode": { "launchTime": "2023-09-29T16:20:20.000Z", "instanceId": "i-0f3fde39...", "publicIpAddress": "52.23....", "instanceType": "t3.large", "state": "running", "privateIpAddress": "172.31...." }, "version": "3.7.1", "clusterConfiguration": { "url": "https://parallelcluster-...-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.1/clusters/..." }, "tags": [ { "value": "3.7.1", "key": "parallelcluster:version" }, { "value": "multi-01", "key": "parallelcluster:cluster-name" } ], "cloudFormationStackStatus": "CREATE_COMPLETE", "clusterName": "multi-01", "computeFleetStatus": "RUNNING", "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:...:stack/multi-01/", "lastUpdatedTime": "2023-09-29T16:15:44.319Z", "region": "us-east-1", "clusterStatus": "CREATE_COMPLETE", "scheduler": { "type": "slurm" } } ```
Bug description and how to reproduce: When running a cluster with SPOT instances, slurm jobs will enter a BadConstraints state after a node preemption. This behavior has been observed in both PCluster 2.11.x and 3.7.1.
We are submitting embarrassingly parallel jobs in an automated fashion such every job has the same submission process and requirements. Occasionally, a slurm job will enter a BadConstraints state after a node is preempted. For example:
scontrol doesn't show much difference:
Looking at /var/log/slurmctld.log
The last message repeats each time the scheduler is run such that the job never runs.
Corresponding messages in /var/log/parallelcluster/clustermgtd
The job was configured in a way that it was able to start once. However, after a preemption, I can't seem to find a way to remove the BadConstraints on the job to allow the scheduler to run the job again. The only fix we've been able to get work is to scancel the job and resubmit a similar job. Furthermore, newly scheduled jobs are able to get nodes provisioned and run successfully.