p4d instance not able to run job with pcluster 3.11.1

QuintenSchrevens commented 2 weeks ago

Issue: Job Stuck on p4d Compute Node

Required Information

AWS ParallelCluster Version: 3.11.1
Cluster Name: test-cluster
Region: eu-west-1

Cluster Configuration (Sensitive information omitted)

HeadNode:
  InstanceType: c5.large
  Networking:
    SubnetId: subnet-xxxxxxxxxx
    AdditionalSecurityGroups:
      - sg-xxxxxxxxxxxxxx
  LocalStorage:
    RootVolume:
      VolumeType: gp3
      Size: 200
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonS3FullAccess
      - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
      - Policy: arn:aws:iam::xxxxxxxxxxxxxxxxxxx

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    MungeKeySecretArn: xxxxxxxxxxxxx
  SlurmQueues:
    - Name: a100
      ComputeResources:
        - Name: p4d
          Instances:
            - InstanceType: p4d.24xlarge
          MinCount: 0
          MaxCount: 5
      Networking:
        SubnetIds:
          - subnet-xxxxxxxxxxxx
        AdditionalSecurityGroups:
          - sg-xxxxxxxxxxxxx
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
            Size: 200
      Iam:
        AdditionalIamPolicies:
          - Policy: arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess

Image:
  Os: alinux2023
  CustomAmi: ami-xxxxxxxxxxxxx

Bug Description

When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.

Observation: The compute node launches successfully, with no immediate errors or unusual logs observed during startup.
Workaround: Replacing the p4d.24xlarge instance type with g5.24xlarge resolves the issue, indicating the problem may be specific to p4d instances.

Potential Cause This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.

Temporary Fix

Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.

Steps to Reproduce

Configure a cluster with a p4d compute node, as shown in the provided YAML configuration.
Attempt to submit a job to the p4d node.
Observe that the job remains in the Slurm queue, ultimately timing out and retrying without successfully executing.

Expected Behavior

The job should execute on the p4d compute node without getting stuck in CF status.

Request

Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?

gmarciani commented 1 week ago

Hi @QuintenSchrevens, thank you for reporting this problem.

We are taking a look at it and will post an update here soon. We observed that downgrading the NVIDIa drivers to version 535.183.01 solves the problem. Would this be a viable solution for you?

Thank you.

gmarciani commented 1 day ago

Issue and mitigation: https://github.com/aws/aws-parallelcluster/issues/6571

aws / aws-parallelcluster