aws-samples / genai-video-super-resolution

MIT No Attribution
9 stars 1 forks source link

ParallelCluster creation fails: WaitCondition timed out #3

Open MustaphaU opened 1 month ago

MustaphaU commented 1 month ago

Hi, thanks for creating and sharing this incredible project!

For some reason, I am unable to create the ParallelCluster.

To troubleshoot, I reviewed the event logs in cloudformation and it looks like the failure is due to the HeadNodeWaitCondition

The following resource(s) failed to create: [HeadNodeWaitCondition20240602214453]. 
image

Here is my custom command to create the config file, some details are redacted:

cd pcluster

 ./install.sh -s video-upscaling-scripts -k mykeypair  -v subnet-************ -u subnet-0***********0  -b s3://bucket/bootstrap/compute-node-configured.sh -d s3://bucket/bootstrap/compute-node-cpu-configured.sh -n s3://bucket/bootstrap/head-node-configured.sh -g ami-054a2******299537 -r us-west-2

..and to create the cluster:

pcluster create-cluster --cluster-name superres-cluster --cluster-configuration /tmp/cluster-config.yaml

I also checked and noticed that I do not currently have any quota for the g5.2xlarge instance for compute nodes as specified in the cluster-config-template.yaml, so I have requested an increase in the meantime.

Any idea how I can resolve this?

Edit:

I have added some relevant part of the logs from HeadNode:

Amazon Linux 2
Kernel 5.10.216-204.855.amzn2.x86_64 on an x86_64

ip-10-0-9-101 login: [  366.371712] cloud-init[2943]: Unknown error retrieving HeadNodeLaunchTemplate
[  366.404797] cloud-init[2943]: ++ cat /tmp/wait_condition_handle.txt
[  366.406185] cloud-init[2943]: cat: /tmp/wait_condition_handle.txt: No such file or directory
[  366.407569] cloud-init[2943]: + wait_condition_handle_presigned_url=
[  366.408608] cloud-init[2943]: + custom_cookbook=NONE
[  366.409340] cloud-init[2943]: + export _region=us-west-2
[  366.413355] cloud-init[2943]: + _region=us-west-2
[  366.415351] cloud-init[2943]: + s3_url=amazonaws.com
[  366.416085] cloud-init[2943]: + '[' NONE '!=' NONE ']'
[  366.416882] cloud-init[2943]: + export parallelcluster_version=aws-parallelcluster-3.9.2
[  366.418113] cloud-init[2943]: + parallelcluster_version=aws-parallelcluster-3.9.2
[  366.419246] cloud-init[2943]: + export cookbook_version=aws-parallelcluster-cookbook-3.9.2
[  366.420991] cloud-init[2943]: + cookbook_version=aws-parallelcluster-cookbook-3.9.2
[  366.422565] cloud-init[2943]: + export chef_version=18.2.7
[  366.423504] cloud-init[2943]: + chef_version=18.2.7
[  366.424279] cloud-init[2943]: + export berkshelf_version=8.0.7
[  366.425086] cloud-init[2943]: + berkshelf_version=8.0.7
[  366.425799] cloud-init[2943]: + '[' -f /opt/parallelcluster/.bootstrapped ']'
[  366.511291] cloud-init[2943]: ++ cat /opt/parallelcluster/.bootstrapped
[  366.690072] cloud-init[2943]: + installed_version=aws-parallelcluster-cookbook-3.9.2
[  366.690525] cloud-init[2943]: + '[' aws-parallelcluster-cookbook-3.9.2 '!=' aws-parallelcluster-cookbook-3.9.2 ']'
[  366.690709] cloud-init[2943]: + '[' NONE '!=' NONE ']'
[  366.690856] cloud-init[2943]: + cfn-init -s superres-cluster -v -c default -r HeadNodeLaunchTemplate --region us-west-2 --url https://cloudformation.us-west-2.amazonaws.com
[  677.561128] cloud-init[2943]: Unknown error retrieving HeadNodeLaunchTemplate
[  677.591035] cloud-init[2943]: + error_exit 'Failed to bootstrap the head node. Please check /var/log/cfn-init.log and /var/log/chef-client.log in the head node or in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.'
[  677.591328] cloud-init[2943]: + sleep 10
[  687.596094] cloud-init[2943]: +++ stat --printf=%s /tmp/wait_condition_handle.txt
[  687.599307] cloud-init[2943]: stat: cannot stat ‘/tmp/wait_condition_handle.txt’: No such file or directory

Edit 2: quota increase for the compute node resources (i.e. g5.2xlarge) approved but the error persists.

MustaphaU commented 1 month ago

I was able to get to this point where the headnode, and compute nodes were created by setting ElasticIp and AssignPublicIp to true in the configuration file. However, the cluster creation still fails.

image
MustaphaU commented 3 weeks ago

Not resolved yet. @hoai

hoai commented 3 weeks ago

@MustaphaU Dear Mustapha,

I found a detailed error "WaitCondition received failed message: 'Failed to mount FSX. Please check /var/log/chef-client.log in the head node, or check the chef-client.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html for more details.' for uniqueId: i-0c3db5db03b82b28c" Today I will try to fix it. and let you know if it is working.

Thank you. Regards.

MustaphaU commented 3 weeks ago

Thanks @hoai . Looking forward to it.