aws-samples / genai-video-super-resolution

MIT No Attribution
13 stars 2 forks source link

ParallelCluster creation fails: WaitCondition timed out #3

Open MustaphaU opened 5 months ago

MustaphaU commented 5 months ago

Hi, thanks for creating and sharing this incredible project!

For some reason, I am unable to create the ParallelCluster.

To troubleshoot, I reviewed the event logs in cloudformation and it looks like the failure is due to the HeadNodeWaitCondition

The following resource(s) failed to create: [HeadNodeWaitCondition20240602214453]. 
image

Here is my custom command to create the config file, some details are redacted:

cd pcluster

 ./install.sh -s video-upscaling-scripts -k mykeypair  -v subnet-************ -u subnet-0***********0  -b s3://bucket/bootstrap/compute-node-configured.sh -d s3://bucket/bootstrap/compute-node-cpu-configured.sh -n s3://bucket/bootstrap/head-node-configured.sh -g ami-054a2******299537 -r us-west-2

..and to create the cluster:

pcluster create-cluster --cluster-name superres-cluster --cluster-configuration /tmp/cluster-config.yaml

I also checked and noticed that I do not currently have any quota for the g5.2xlarge instance for compute nodes as specified in the cluster-config-template.yaml, so I have requested an increase in the meantime.

Any idea how I can resolve this?

Edit:

I have added some relevant part of the logs from HeadNode:

Amazon Linux 2
Kernel 5.10.216-204.855.amzn2.x86_64 on an x86_64

ip-10-0-9-101 login: [  366.371712] cloud-init[2943]: Unknown error retrieving HeadNodeLaunchTemplate
[  366.404797] cloud-init[2943]: ++ cat /tmp/wait_condition_handle.txt
[  366.406185] cloud-init[2943]: cat: /tmp/wait_condition_handle.txt: No such file or directory
[  366.407569] cloud-init[2943]: + wait_condition_handle_presigned_url=
[  366.408608] cloud-init[2943]: + custom_cookbook=NONE
[  366.409340] cloud-init[2943]: + export _region=us-west-2
[  366.413355] cloud-init[2943]: + _region=us-west-2
[  366.415351] cloud-init[2943]: + s3_url=amazonaws.com
[  366.416085] cloud-init[2943]: + '[' NONE '!=' NONE ']'
[  366.416882] cloud-init[2943]: + export parallelcluster_version=aws-parallelcluster-3.9.2
[  366.418113] cloud-init[2943]: + parallelcluster_version=aws-parallelcluster-3.9.2
[  366.419246] cloud-init[2943]: + export cookbook_version=aws-parallelcluster-cookbook-3.9.2
[  366.420991] cloud-init[2943]: + cookbook_version=aws-parallelcluster-cookbook-3.9.2
[  366.422565] cloud-init[2943]: + export chef_version=18.2.7
[  366.423504] cloud-init[2943]: + chef_version=18.2.7
[  366.424279] cloud-init[2943]: + export berkshelf_version=8.0.7
[  366.425086] cloud-init[2943]: + berkshelf_version=8.0.7
[  366.425799] cloud-init[2943]: + '[' -f /opt/parallelcluster/.bootstrapped ']'
[  366.511291] cloud-init[2943]: ++ cat /opt/parallelcluster/.bootstrapped
[  366.690072] cloud-init[2943]: + installed_version=aws-parallelcluster-cookbook-3.9.2
[  366.690525] cloud-init[2943]: + '[' aws-parallelcluster-cookbook-3.9.2 '!=' aws-parallelcluster-cookbook-3.9.2 ']'
[  366.690709] cloud-init[2943]: + '[' NONE '!=' NONE ']'
[  366.690856] cloud-init[2943]: + cfn-init -s superres-cluster -v -c default -r HeadNodeLaunchTemplate --region us-west-2 --url https://cloudformation.us-west-2.amazonaws.com
[  677.561128] cloud-init[2943]: Unknown error retrieving HeadNodeLaunchTemplate
[  677.591035] cloud-init[2943]: + error_exit 'Failed to bootstrap the head node. Please check /var/log/cfn-init.log and /var/log/chef-client.log in the head node or in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.'
[  677.591328] cloud-init[2943]: + sleep 10
[  687.596094] cloud-init[2943]: +++ stat --printf=%s /tmp/wait_condition_handle.txt
[  687.599307] cloud-init[2943]: stat: cannot stat ‘/tmp/wait_condition_handle.txt’: No such file or directory

Edit 2: quota increase for the compute node resources (i.e. g5.2xlarge) approved but the error persists.

MustaphaU commented 5 months ago

I was able to get to this point where the headnode, and compute nodes were created by setting ElasticIp and AssignPublicIp to true in the configuration file. However, the cluster creation still fails.

image
MustaphaU commented 4 months ago

Not resolved yet. @hoai

hoai commented 4 months ago

@MustaphaU Dear Mustapha,

I found a detailed error "WaitCondition received failed message: 'Failed to mount FSX. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' for uniqueId: i-0c3db5db03b82b28c" Today I will try to fix it. and let you know if it is working.

Thank you. Regards.

MustaphaU commented 4 months ago

Thanks @hoai . Looking forward to it.

iamh2o commented 2 weeks ago

i have added this to most of my pcluster yaml configs, and seems to have resolved most timeout problems.


DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 3600
    ComputeNodeBootstrapTimeout: 3600
MustaphaU commented 2 weeks ago

@iamh2o Thanks for your suggestion --- increasing the timeout was one of the hacks I tried, but it never resolved the issue.

Based on this notice that went out in June, maybe it's got to do with the pcluster version ?

"Our records indicate you have used AWS ParallelCluster within the last six months since the receipt of this notice. This message is pertinent to AWS ParallelCluster versions 3.9.0 and higher.

ParallelCluster v3.9.0 and higher were shipped with Slurm version 23.11.x. With this version of Slurm, we have identified an issue where the IP addresses of new nodes are not communicated properly to other nodes, causing the node to be marked as unhealthy which is then terminated and replaced by ParallelCluster. While the replacement of unhealthy nodes works as expected, in this case a wrong IP address is assigned every time a new node starts up, resulting in an endless bootup-terminating loop that may affect running jobs.

We fixed this issue after consulting SchedMD by removing the  cloud_dns configuration from SlurmctldParameters in the Slurm configuration ParallelCluster applies. To mitigate or avoid the issue, we recommend using ParallelCluster v3.9.3 or version 3.10.0 to recreate your cluster.  For more information on how to recreate your cluster please refer the ParallelCluster documentation page [1].

Alternately, you can update the Slurm version on existing clusters to version 23.11.8 to mitigate or avoid this issue. Refer to this ParallelCluster GitHub wiki [2] for directions on upgrading the Slurm version on existing clusters."