aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
827 stars 312 forks source link

Slurm job enters BadConstraints after spot node is preempted #5731

Open JosephDVL opened 11 months ago

JosephDVL commented 11 months ago

Required Info:

Region: us-east-1
Image:
  Os: alinux2
SharedStorage:
  - Name: custom1
    StorageType: Ebs
    MountDir: shared
    EbsSettings:
      Size: $ebs_volume_size
HeadNode:
  InstanceType: t3.large
  Networking:
    SubnetId: $master_subnet_id
    ElasticIp: false
  Ssh:
    KeyName:
    AllowedIps: 
  LocalStorage:
    RootVolume:
      Size: 40
  CustomActions:
    OnNodeConfigured:
      Script: s3://.../cluster_init3.sh
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 20
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: default-resource
      SpotPrice: $spot_price
      MaxCount: 200
      Instances:
        - InstanceType: c7i.large
        - InstanceType: c6i.large
        - InstanceType: c5.large
    AllocationStrategy: capacity-optimized
    ComputeSettings:
      LocalStorage:
        RootVolume:
          Size: 40
    CapacityType: SPOT
    CustomActions:
      OnNodeConfigured:
        Script: s3://.../cluster_init3.sh
    Networking:
      SubnetIds:
        - $compute_subnet_id
Monitoring:
  Dashboards:
    CloudWatch:
      Enabled: False

describe-cluster

``` { "creationTime": "2023-09-29T16:15:44.319Z", "headNode": { "launchTime": "2023-09-29T16:20:20.000Z", "instanceId": "i-0f3fde39...", "publicIpAddress": "52.23....", "instanceType": "t3.large", "state": "running", "privateIpAddress": "172.31...." }, "version": "3.7.1", "clusterConfiguration": { "url": "https://parallelcluster-...-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.1/clusters/..." }, "tags": [ { "value": "3.7.1", "key": "parallelcluster:version" }, { "value": "multi-01", "key": "parallelcluster:cluster-name" } ], "cloudFormationStackStatus": "CREATE_COMPLETE", "clusterName": "multi-01", "computeFleetStatus": "RUNNING", "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:...:stack/multi-01/", "lastUpdatedTime": "2023-09-29T16:15:44.319Z", "region": "us-east-1", "clusterStatus": "CREATE_COMPLETE", "scheduler": { "type": "slurm" } } ```

Bug description and how to reproduce: When running a cluster with SPOT instances, slurm jobs will enter a BadConstraints state after a node preemption. This behavior has been observed in both PCluster 2.11.x and 3.7.1.

We are submitting embarrassingly parallel jobs in an automated fashion such every job has the same submission process and requirements. Occasionally, a slurm job will enter a BadConstraints state after a node is preempted. For example:

# squeue | egrep 24[04]
               240    queue1 both_inn ec2-user PD       0:00      1 (BadConstraints)
               244    queue1 both_inn ec2-user  R 1-23:48:56      1 queue1-dy-default-resource-73

scontrol doesn't show much difference:

# scontrol show jobid=240
JobId=240 JobName=both_inner_loop
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=0 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=BadConstraints FailedNode=queue1-dy-default-resource-142 Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-09-29T21:26:18 EligibleTime=2023-09-29T21:26:18
   AccrueTime=2023-09-29T21:26:18
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-10-02T19:38:53 Scheduler=Main
   Partition=queue1 AllocNode:Sid=ip-172-31-17-218:14496
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue1-dy-default-resource-142
   BatchHost=queue1-dy-default-resource-142
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=3891M,node=1,billing=1
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/shared/
   WorkDir=/shared/
   StdErr=/shared/
   StdIn=/dev/null
   StdOut=/shared/
   Power=

# scontrol show jobid=244
JobId=244 JobName=both_inner_loop
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=4294901516 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-23:49:15 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-09-29T21:26:33 EligibleTime=2023-09-29T21:26:33
   AccrueTime=2023-09-29T21:26:33
   StartTime=2023-09-30T21:40:16 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-09-30T21:40:16 Scheduler=Main
   Partition=queue1 AllocNode:Sid=ip-172-31-17-218:14496
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue1-dy-default-resource-73
   BatchHost=queue1-dy-default-resource-73
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=3891M,node=1,billing=1
   AllocTRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/shared/
   WorkDir=/shared/
   StdErr=/shared/
   StdIn=/dev/null
   StdOut=/shared/
   Power=

Looking at /var/log/slurmctld.log

# egrep 24[04] slurmctld.log
[2023-09-29T21:26:18.710] _slurm_rpc_submit_batch_job: JobId=240 InitPrio=4294901520 usec=698
[2023-09-29T21:26:33.724] _slurm_rpc_submit_batch_job: JobId=244 InitPrio=4294901516 usec=581
[2023-09-30T20:50:10.734] sched: Allocate JobId=240 NodeList=queue1-dy-default-resource-142 #CPUs=2 Partition=queue1
[2023-09-30T21:40:16.250] sched: Allocate JobId=244 NodeList=queue1-dy-default-resource-73 #CPUs=2 Partition=queue1
[2023-10-01T01:36:13.873] requeue job JobId=240 due to failure of node queue1-dy-default-resource-142
[2023-10-01T01:40:06.870] cleanup_completing: JobId=240 completion process took 218 seconds
[2023-10-01T01:40:51.948] _pick_best_nodes: JobId=240 never runnable in partition queue1
[2023-10-01T01:40:51.948] sched: schedule: JobId=240 non-runnable: Requested node configuration is not available
[2023-10-01T07:23:53.189] _pick_best_nodes: JobId=240 never runnable in partition queue1
[2023-10-01T07:23:53.189] sched: schedule: JobId=240 non-runnable: Requested node configuration is not available

The last message repeats each time the scheduler is run such that the job never runs.

Corresponding messages in /var/log/parallelcluster/clustermgtd

2023-10-01 01:36:13,811 - [slurm_plugin.slurm_resources:is_backing_instance_valid] - WARNING - Node state check: no corresponding instance in EC2 for node queue1-dy-default-resource-142(172.31.36.119), node state: ALLOCATED+CLOUD
2023-10-01 01:36:13,814 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['queue1-dy-default-resource-142(172.31.36.119)']

The job was configured in a way that it was able to start once. However, after a preemption, I can't seem to find a way to remove the BadConstraints on the job to allow the scheduler to run the job again. The only fix we've been able to get work is to scancel the job and resubmit a similar job. Furthermore, newly scheduled jobs are able to get nodes provisioned and run successfully.

JosephDVL commented 11 months ago

I've found a reference to an old Slurm mailing list post (https://groups.google.com/g/slurm-users/c/kshbXbqpEIY/m/nJcTyVQiIAAJ) which seems to address the issue. The fix in the e-mail does work for this case:

scontrol update jobid=240 NumNodes=1-1

Per the e-mail, if the job script includes “-N 1” everything works correctly.