clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

Terrform script hangs at run_ansible. #38

Closed p-1603 closed 2 years ago

p-1603 commented 3 years ago

Hello,

I think this problem may or may not be related to issue #34.

I have created 3 CITC clusters on Oracle OCI over the last 24 hours using the CITC docs, and after running terraform apply oracle and ssh'ing into the management node, I find that the finish script doesn't run, giving only the following error:

[citc@mgmt ~]$ finish Error: The management node has not finished its setup Please allow it to finish before continuing. For information about why they have not finished, check the file /root/ansible-pull.log

The ansible-pull.log is as follows:

Thursday 10 June 2021  12:21:11 +0000 (0:00:00.080)       0:02:21.844 *********
===============================================================================
slurm : install Slurm ------------------------------------------------------ 14.67s
install ssh server ---------------------------------------------------------- 5.83s
mysql : install mariadb module ---------------------------------------------- 5.45s
ldap : enable 389-ds module ------------------------------------------------- 5.42s
ldap : install 389-ds ------------------------------------------------------- 5.29s
slurm : install munge ------------------------------------------------------- 5.10s
install python38 ------------------------------------------------------------ 5.09s
slurm : install ssh-keygen -------------------------------------------------- 4.88s
packer : Ensure unzip is installed. ----------------------------------------- 4.87s
filesystem : install nfs-utils ---------------------------------------------- 4.86s
slurm : install SELinux-policy ---------------------------------------------- 4.85s
slurm : install firewalld --------------------------------------------------- 4.85s
mysql : install mariadb ----------------------------------------------------- 4.84s
ldap : install nss-tools ---------------------------------------------------- 4.83s
mysql : install PyMySQL ----------------------------------------------------- 4.81s
security_updates : Install security updates --------------------------------- 4.80s
slurm : install python-firewall --------------------------------------------- 4.80s
slurm : install common tools ------------------------------------------------ 3.96s
slurm : install OCI tools --------------------------------------------------- 3.69s
Gathering Facts ------------------------------------------------------------- 1.20s
Playbook run took 0 days, 0 hours, 2 minutes, 21 seconds

Even after leaving the script overnight, no further progress was made.

I found that there was one failure in the ansible-pull.log prior to this:

TASK [security_updates : Install security updates] *********************************Thursday 10 June 2021  12:21:06 +0000 (0:00:00.439)       0:02:16.960 *********
fatal: [mgmt.subnet.intimatecollie.oraclevcn.com]: FAILED! => changed=true
  cmd:
  - dnf
  - update
  - -y
  - --security
  - --exclude
  - kernel*
  delta: '0:00:04.545339'
  end: '2021-06-10 12:21:11.715756'
  msg: non-zero return code
  rc: 1
  start: '2021-06-10 12:21:07.170417'
  stderr: |-
    Error:
     Problem: problem with installed package slurm-libpmi-20.02.5-1.20.x86_64
      - package slurm-libpmi-20.02.5-1.20.x86_64 requires libslurmfull.so()(64bit), but none of the providers can be installed
      - package slurm-libpmi-20.02.5-1.20.x86_64 requires slurm(x86-64) = 20.02.5-1.20, but none of the providers can be installed
      - cannot install both slurm-20.11.7-3.el8.x86_64 and slurm-20.02.5-1.20.x86_64      - cannot install both slurm-20.02.5-1.20.x86_64 and slurm-20.11.7-3.el8.x86_64      - cannot install the best update candidate for package slurm-20.02.5-1.20.x86_64
  stderr_lines: <omitted>
  stdout: |-
    Last metadata expiration check: 2:12:54 ago on Thu Jun 10 10:08:14 2021.
    (try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
  stdout_lines: <omitted>

If I comment out the - security_updates role in the management.yml in the citc-ansible checkout, and re-run run_ansible, then it runs until here:

PLAY [finalise] ********************************************************************
TASK [finalise : Wait for packer to finish] ****************************************Thursday 10 June 2021  12:29:13 +0000 (0:00:00.908)       0:02:47.083 *********
FAILED - RETRYING: Wait for packer to finish (100 retries left).
FAILED - RETRYING: Wait for packer to finish (99 retries left).
FAILED - RETRYING: Wait for packer to finish (98 retries left).
FAILED - RETRYING: Wait for packer to finish (97 retries left).
FAILED - RETRYING: Wait for packer to finish (96 retries left).
FAILED - RETRYING: Wait for packer to finish (95 retries left).
FAILED - RETRYING: Wait for packer to finish (94 retries left).
FAILED - RETRYING: Wait for packer to finish (93 retries left).
FAILED - RETRYING: Wait for packer to finish (92 retries left).
FAILED - RETRYING: Wait for packer to finish (91 retries left).
FAILED - RETRYING: Wait for packer to finish (90 retries left).
FAILED - RETRYING: Wait for packer to finish (89 retries left).
FAILED - RETRYING: Wait for packer to finish (88 retries left).
FAILED - RETRYING: Wait for packer to finish (87 retries left).
FAILED - RETRYING: Wait for packer to finish (86 retries left).
FAILED - RETRYING: Wait for packer to finish (85 retries left).
FAILED - RETRYING: Wait for packer to finish (84 retries left).
FAILED - RETRYING: Wait for packer to finish (83 retries left).
FAILED - RETRYING: Wait for packer to finish (82 retries left).
FAILED - RETRYING: Wait for packer to finish (81 retries left).

I set up a cluster using the same instructions last week, when it seemed to work as normal and I could run finish within a few minutes of running terraform apply.

willfurnass commented 3 years ago

I think the first problem might be that the 'powertools' repository isn't defined when Ansible tries installing Slurm on the mgmtnode - the powertools repo is enabled but only by a role that would be applied much later in the management.yml Ansible playbook.

milliams commented 2 years ago

I have identified the cause of the problem. In this case, it is because the "update all packages" script was trying to update to an newer version of Slurm which is not packaged fully yet. I have added a fix to the Ansible package so that a new cluster should not build correctly.

Often the simplest solution with CitC is to destroy and recreate a cluster from scratch to get newer version. In this case, you can apply the fix by logging in to the management node, and as root running /root/update_ansible_repo and then /root/run_ansible.

Will is right though that PowerTools has caused similar issues in the past due to case sensitivity.

p-1603 commented 2 years ago

Hello again,

I created another cluster (Oracle) earlier today, and found that the conflicts in ansible were resolved, but the ansible script still only ran up until this point, where it hung. This meant that the finish script can still not be run.

TASK [finalise : Wait for packer to finish] ************************************
Wednesday 21 July 2021  13:55:30 +0000 (0:00:01.552)       0:09:15.005 ********
FAILED - RETRYING: Wait for packer to finish (100 retries left).
FAILED - RETRYING: Wait for packer to finish (99 retries left).
FAILED - RETRYING: Wait for packer to finish (98 retries left).
FAILED - RETRYING: Wait for packer to finish (97 retries left).
FAILED - RETRYING: Wait for packer to finish (96 retries left).
FAILED - RETRYING: Wait for packer to finish (95 retries left).
FAILED - RETRYING: Wait for packer to finish (94 retries left).
FAILED - RETRYING: Wait for packer to finish (93 retries left).
FAILED - RETRYING: Wait for packer to finish (92 retries left).
FAILED - RETRYING: Wait for packer to finish (91 retries left).
FAILED - RETRYING: Wait for packer to finish (90 retries left).
FAILED - RETRYING: Wait for packer to finish (89 retries left).
FAILED - RETRYING: Wait for packer to finish (88 retries left).
FAILED - RETRYING: Wait for packer to finish (87 retries left).
FAILED - RETRYING: Wait for packer to finish (86 retries left).
FAILED - RETRYING: Wait for packer to finish (85 retries left).
FAILED - RETRYING: Wait for packer to finish (84 retries left).
FAILED - RETRYING: Wait for packer to finish (83 retries left).
FAILED - RETRYING: Wait for packer to finish (82 retries left).
FAILED - RETRYING: Wait for packer to finish (81 retries left).
FAILED - RETRYING: Wait for packer to finish (80 retries left).
FAILED - RETRYING: Wait for packer to finish (79 retries left).
FAILED - RETRYING: Wait for packer to finish (78 retries left).
FAILED - RETRYING: Wait for packer to finish (77 retries left).
FAILED - RETRYING: Wait for packer to finish (76 retries left).
FAILED - RETRYING: Wait for packer to finish (75 retries left).
FAILED - RETRYING: Wait for packer to finish (74 retries left).
FAILED - RETRYING: Wait for packer to finish (73 retries left).
FAILED - RETRYING: Wait for packer to finish (72 retries left).
FAILED - RETRYING: Wait for packer to finish (71 retries left).
FAILED - RETRYING: Wait for packer to finish (70 retries left).
FAILED - RETRYING: Wait for packer to finish (69 retries left).
FAILED - RETRYING: Wait for packer to finish (68 retries left).
FAILED - RETRYING: Wait for packer to finish (67 retries left).
FAILED - RETRYING: Wait for packer to finish (66 retries left).
FAILED - RETRYING: Wait for packer to finish (65 retries left).
FAILED - RETRYING: Wait for packer to finish (64 retries left).

I'll update this if I discover anything more.

milliams commented 2 years ago

The packer run can take a long time to finish, especially on Oracle. I have increased the time-out on the latest version on 6 to 200 attempts as well.