Closed p-1603 closed 3 years ago
I think the first problem might be that the 'powertools' repository isn't defined when Ansible tries installing Slurm on the mgmt
node - the powertools repo is enabled but only by a role that would be applied much later in the management.yml
Ansible playbook.
I have identified the cause of the problem. In this case, it is because the "update all packages" script was trying to update to an newer version of Slurm which is not packaged fully yet. I have added a fix to the Ansible package so that a new cluster should not build correctly.
Often the simplest solution with CitC is to destroy and recreate a cluster from scratch to get newer version. In this case, you can apply the fix by logging in to the management node, and as root running /root/update_ansible_repo
and then /root/run_ansible
.
Will is right though that PowerTools has caused similar issues in the past due to case sensitivity.
Hello again,
I created another cluster (Oracle) earlier today, and found that the conflicts in ansible were resolved, but the ansible script still only ran up until this point, where it hung. This meant that the finish
script can still not be run.
TASK [finalise : Wait for packer to finish] ************************************
Wednesday 21 July 2021 13:55:30 +0000 (0:00:01.552) 0:09:15.005 ********
FAILED - RETRYING: Wait for packer to finish (100 retries left).
FAILED - RETRYING: Wait for packer to finish (99 retries left).
FAILED - RETRYING: Wait for packer to finish (98 retries left).
FAILED - RETRYING: Wait for packer to finish (97 retries left).
FAILED - RETRYING: Wait for packer to finish (96 retries left).
FAILED - RETRYING: Wait for packer to finish (95 retries left).
FAILED - RETRYING: Wait for packer to finish (94 retries left).
FAILED - RETRYING: Wait for packer to finish (93 retries left).
FAILED - RETRYING: Wait for packer to finish (92 retries left).
FAILED - RETRYING: Wait for packer to finish (91 retries left).
FAILED - RETRYING: Wait for packer to finish (90 retries left).
FAILED - RETRYING: Wait for packer to finish (89 retries left).
FAILED - RETRYING: Wait for packer to finish (88 retries left).
FAILED - RETRYING: Wait for packer to finish (87 retries left).
FAILED - RETRYING: Wait for packer to finish (86 retries left).
FAILED - RETRYING: Wait for packer to finish (85 retries left).
FAILED - RETRYING: Wait for packer to finish (84 retries left).
FAILED - RETRYING: Wait for packer to finish (83 retries left).
FAILED - RETRYING: Wait for packer to finish (82 retries left).
FAILED - RETRYING: Wait for packer to finish (81 retries left).
FAILED - RETRYING: Wait for packer to finish (80 retries left).
FAILED - RETRYING: Wait for packer to finish (79 retries left).
FAILED - RETRYING: Wait for packer to finish (78 retries left).
FAILED - RETRYING: Wait for packer to finish (77 retries left).
FAILED - RETRYING: Wait for packer to finish (76 retries left).
FAILED - RETRYING: Wait for packer to finish (75 retries left).
FAILED - RETRYING: Wait for packer to finish (74 retries left).
FAILED - RETRYING: Wait for packer to finish (73 retries left).
FAILED - RETRYING: Wait for packer to finish (72 retries left).
FAILED - RETRYING: Wait for packer to finish (71 retries left).
FAILED - RETRYING: Wait for packer to finish (70 retries left).
FAILED - RETRYING: Wait for packer to finish (69 retries left).
FAILED - RETRYING: Wait for packer to finish (68 retries left).
FAILED - RETRYING: Wait for packer to finish (67 retries left).
FAILED - RETRYING: Wait for packer to finish (66 retries left).
FAILED - RETRYING: Wait for packer to finish (65 retries left).
FAILED - RETRYING: Wait for packer to finish (64 retries left).
I'll update this if I discover anything more.
The packer run can take a long time to finish, especially on Oracle. I have increased the time-out on the latest version on 6
to 200 attempts as well.
Hello,
I think this problem may or may not be related to issue #34.
I have created 3 CITC clusters on Oracle OCI over the last 24 hours using the CITC docs, and after running
terraform apply oracle
and ssh'ing into the management node, I find that thefinish
script doesn't run, giving only the following error:[citc@mgmt ~]$ finish Error: The management node has not finished its setup Please allow it to finish before continuing. For information about why they have not finished, check the file /root/ansible-pull.log
The ansible-pull.log is as follows:
Even after leaving the script overnight, no further progress was made.
I found that there was one failure in the ansible-pull.log prior to this:
If I comment out the
- security_updates
role in themanagement.yml
in thecitc-ansible
checkout, and re-runrun_ansible
, then it runs until here:I set up a cluster using the same instructions last week, when it seemed to work as normal and I could run
finish
within a few minutes of runningterraform apply
.