Open QuintenSchrevens opened 2 weeks ago
Hi @QuintenSchrevens, thank you for reporting this problem.
We are taking a look at it and will post an update here soon. We observed that downgrading the NVIDIa drivers to version 535.183.01 solves the problem. Would this be a viable solution for you?
Thank you.
Issue and mitigation: https://github.com/aws/aws-parallelcluster/issues/6571
Issue: Job Stuck on p4d Compute Node
Required Information
test-cluster
eu-west-1
Cluster Configuration (Sensitive information omitted)
Bug Description
When attempting to run a job on any p4d compute node, the job becomes stuck in the Slurm status, remaining in a pending state until it times out and retries. This issue occurs even when no special boot scripts are configured. I also did not see anything special in the CloudWatch dashboard logs or on the machine itself.
p4d.24xlarge
instance type withg5.24xlarge
resolves the issue, indicating the problem may be specific top4d
instances.Potential Cause This issue may be connected to the increased configuration times introduced in version 3.11.1, as reported in GitHub issue #6479. The longer setup duration might be impacting the readiness or responsiveness of p4d compute nodes in Slurm, causing jobs to remain in a CF state.
Temporary Fix
Downgrading to AWS ParallelCluster version 3.10.1 resolves the issue, allowing jobs to run on p4d instances without getting stuck. This suggests that the issue may be related to changes introduced in version 3.11.1.
Steps to Reproduce
Expected Behavior
The job should execute on the p4d compute node without getting stuck in CF status.
Request
Are known issues with p4d instances on AWS ParallelCluster version 3.11.1 or specific configurations required to support job execution on p4d nodes?