clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

FAILED - RETRYING: Wait for packer to finish (200 retries left). #42

Closed eshnil2000 closed 2 years ago

eshnil2000 commented 2 years ago

on AWS, I'm able to create master node but when I ssh into it and run :

sudo tail -f /root/ansible-pull.log

I am stuck at "FAILED - RETRYING: Wait for packer to finish (200 retries left)." This never completes.

Some other considerations: To restrict SSH access only from my laptop, I changed the Management node ingress security group variables.tf settings to

resource "aws_security_group" "mgmt" {
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["MY.IP.ADDRESS/32"]
  }

I also changed security-groups.tf default region

# AWS Information
variable "region" {
  default = "us-west-1"
}
milliams commented 2 years ago

Does the log ever get past "(200 retries left)"? When I look at a successful log I see

TASK [finalise : Wait for packer to finish] ************************************
Monday 09 August 2021  13:36:14 +0000 (0:00:00.743)       0:08:12.711 ********* 
FAILED - RETRYING: Wait for packer to finish (200 retries left).
FAILED - RETRYING: Wait for packer to finish (199 retries left).
FAILED - RETRYING: Wait for packer to finish (198 retries left).
FAILED - RETRYING: Wait for packer to finish (197 retries left).
FAILED - RETRYING: Wait for packer to finish (196 retries left).
FAILED - RETRYING: Wait for packer to finish (195 retries left).
FAILED - RETRYING: Wait for packer to finish (194 retries left).
FAILED - RETRYING: Wait for packer to finish (193 retries left).
FAILED - RETRYING: Wait for packer to finish (192 retries left).
FAILED - RETRYING: Wait for packer to finish (191 retries left).
FAILED - RETRYING: Wait for packer to finish (190 retries left).
FAILED - RETRYING: Wait for packer to finish (189 retries left).
FAILED - RETRYING: Wait for packer to finish (188 retries left).
FAILED - RETRYING: Wait for packer to finish (187 retries left).
FAILED - RETRYING: Wait for packer to finish (186 retries left).
FAILED - RETRYING: Wait for packer to finish (185 retries left).
FAILED - RETRYING: Wait for packer to finish (184 retries left).
FAILED - RETRYING: Wait for packer to finish (183 retries left).
FAILED - RETRYING: Wait for packer to finish (182 retries left).
FAILED - RETRYING: Wait for packer to finish (181 retries left).
FAILED - RETRYING: Wait for packer to finish (180 retries left).
FAILED - RETRYING: Wait for packer to finish (179 retries left).
FAILED - RETRYING: Wait for packer to finish (178 retries left).
FAILED - RETRYING: Wait for packer to finish (177 retries left).
FAILED - RETRYING: Wait for packer to finish (176 retries left).
FAILED - RETRYING: Wait for packer to finish (175 retries left).
FAILED - RETRYING: Wait for packer to finish (174 retries left).
FAILED - RETRYING: Wait for packer to finish (173 retries left).
FAILED - RETRYING: Wait for packer to finish (172 retries left).
FAILED - RETRYING: Wait for packer to finish (171 retries left).
FAILED - RETRYING: Wait for packer to finish (170 retries left).
FAILED - RETRYING: Wait for packer to finish (169 retries left).
FAILED - RETRYING: Wait for packer to finish (168 retries left).
FAILED - RETRYING: Wait for packer to finish (167 retries left).
changed: [mgmt.handy-mosquito.citc.local]

TASK [create directory for the finalised files] ********************************
Monday 09 August 2021  13:44:16 +0000 (0:08:02.209)       0:16:14.921 ********* 
changed: [mgmt.handy-mosquito.citc.local]

...

It should be retrying every 10 seconds.

eshnil2000 commented 2 years ago

It counts down all the way to Zero retries, slurm is not installed at the end of the setup.

I do see a Packer instance created in my AWS account.

milliams commented 2 years ago

It's likely that something is hanging somewhere. Packer does not have time-outs set up for the steps that it does so if one gets stuck it may run forever. This is something I should add in,

Could you run packer manually and paste the output here. To get the full output, as the standard citc user on the cluster, run:

sudo /usr/local/bin/run-packer
eshnil2000 commented 2 years ago

I ran

sudo /usr/local/bin/run-packer

after adding the Management Instance IP address to the security group (created manually) of the Packer instance, I was able to get the packer build to execute successfully and create a compute AMI image.

I am also able to create a new user.

But when I run

sinfo

I get

slurm_load_partitions: Unable to contact slurm controller (connect failure)

I also ran:

finish

output:

Error: The management node has not finished its setup
Please allow it to finish before continuing.
For information about why they have not finished, check the file /root/ansible-pull.log

I deleted the ansible-pull.log file manually prior to running sudo /usr/local/bin/run-packer, it's not created again after run-packer.

milliams commented 2 years ago

You should now be able to switch to the root user (sudo -i) and then run (from the /root folder) the command:

/root/run_ansible --skip-tags=packer

This will allow Ansible to complete its run, skipping the step where it runs Packer (as you've already got that to run successfully). Once that's complete then you can run finish and things should be ready to go.

eshnil2000 commented 2 years ago

Thanks. The cluster works now with:

/root/run_ansible --skip-tags=packer

I'm able to create nodes, log into grafana dashboard, and get to webui dashboard (though admin password retreived from "get_secrets" doesnt seem to work for webui, but will investigate further and open a separate issue if i cant figure out.