Open MustaphaU opened 5 months ago
I was able to get to this point where the headnode
, and compute
nodes were created by setting ElasticIp
and AssignPublicIp
to true
in the configuration file. However, the cluster creation still fails.
Not resolved yet. @hoai
@MustaphaU Dear Mustapha,
I found a detailed error "WaitCondition received failed message: 'Failed to mount FSX. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' for uniqueId: i-0c3db5db03b82b28c" Today I will try to fix it. and let you know if it is working.
Thank you. Regards.
Thanks @hoai . Looking forward to it.
i have added this to most of my pcluster yaml configs, and seems to have resolved most timeout problems.
DevSettings:
Timeouts:
HeadNodeBootstrapTimeout: 3600
ComputeNodeBootstrapTimeout: 3600
@iamh2o Thanks for your suggestion --- increasing the timeout was one of the hacks I tried, but it never resolved the issue.
Based on this notice that went out in June, maybe it's got to do with the pcluster
version ?
"Our records indicate you have used AWS ParallelCluster within the last six months since the receipt of this notice. This message is pertinent to AWS ParallelCluster versions 3.9.0 and higher.
ParallelCluster v3.9.0 and higher were shipped with Slurm version 23.11.x. With this version of Slurm, we have identified an issue where the IP addresses of new nodes are not communicated properly to other nodes, causing the node to be marked as unhealthy which is then terminated and replaced by ParallelCluster. While the replacement of unhealthy nodes works as expected, in this case a wrong IP address is assigned every time a new node starts up, resulting in an endless bootup-terminating loop that may affect running jobs.
We fixed this issue after consulting SchedMD by removing the cloud_dns configuration from SlurmctldParameters in the Slurm configuration ParallelCluster applies. To mitigate or avoid the issue, we recommend using ParallelCluster v3.9.3 or version 3.10.0 to recreate your cluster. For more information on how to recreate your cluster please refer the ParallelCluster documentation page [1].
Alternately, you can update the Slurm version on existing clusters to version 23.11.8 to mitigate or avoid this issue. Refer to this ParallelCluster GitHub wiki [2] for directions on upgrading the Slurm version on existing clusters."
Hi, thanks for creating and sharing this incredible project!
For some reason, I am unable to create the ParallelCluster.
To troubleshoot, I reviewed the event logs in cloudformation and it looks like the failure is due to the
HeadNodeWaitCondition
Here is my custom command to create the config file, some details are redacted:
..and to create the cluster:
I also checked and noticed that I do not currently have any quota for the
g5.2xlarge
instance for compute nodes as specified in thecluster-config-template.yaml
, so I have requested an increase in the meantime.Any idea how I can resolve this?
Edit:
I have added some relevant part of the logs from HeadNode:
Edit 2: quota increase for the compute node resources (i.e. g5.2xlarge) approved but the error persists.