aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
828 stars 312 forks source link

pcluster create hangs in step "Status: MasterServer - CREATE_IN_PROGRESS" when using custom AMI #1088

Closed hkroeger closed 5 years ago

hkroeger commented 5 years ago

Environment:

I created a custom AMI following these steps https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/02_ami_customization.html based on the AMI (eu-west-3 / ubuntu 16.04) ami-02de781189ccb9f92

When I execute pcluster create, the creation hangs in step "Status: MasterServer - CREATE_IN_PROGRESS". The master instance is created and running. I can log into it, but e.g. the scheduler software seems not to be installed.

The cloud-init logfile are attached. There is no cfn.log file.

cloud-init.log cloud-init-output.log

hkroeger commented 5 years ago

Please note: the config file is below:

[cluster spot48] vpc_settings = spot48-vpc key_name = XXXX compute_instance_type = m5.24xlarge master_instance_type = t2.micro initial_queue_size = 0 max_queue_size = 10 maintain_initial_size = false cluster_type = spot shared_dir = /shared ebs_settings = sdshared custom_ami = ami-XXXXX

[vpc spot48-vpc] master_subnet_id = subnet-XXXX vpc_id = vpc-XXXX

[global] update_check = true sanity_check = true cluster_template = spot48

[ebs sdshared] volume_size = 70

[aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

enrico-usai commented 5 years ago

Hi @hkroeger

I see in your cloud-init.log the error:

Failed to get raw userdata in module rightscale_userdata

so it seems there are errors executing the user data of your custom ami.

I think the problem is that in your configuration you didn't set the base_os configuration parameters, so ParallelCluster is trying to use the default one (alinux), that is different from the one you selected at creation time.

You could retry by setting:

[cluster spot48]
...
base_os = ubuntu1604

Let us know if it helps.

hkroeger commented 5 years ago

Hi enrico, I tried to set the "base_os" parameter before and the outcome was the same.

It's probably worth to note that the entire procedure works, if I use centos7 instead of ubuntu1604. Maybe the ubuntu1604 image is not so well maintained?

BTW, are there any plans to make ubuntu1804 available in parallelcluster?

Regards, Hannes

enrico-usai commented 5 years ago

Hi @hkroeger

no-response[bot] commented 5 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.