installation of new cluster doesn't complete

boegel commented 3 years ago

I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".

The management node is being created, and I can SSH into that, but the finish command keep producing this (with or without a limits.yaml file):

[citc@mgmt ~]$ finish
Error: The management node has not finished its setup
Please allow it to finish before continuing.
For information about why they have not finished, check the file /root/ansible-pull.log

The last part in /root/ansible-pull.log is this:

TASK [slurm : open all ports] **************************************************
Friday 19 February 2021  14:19:11 +0000 (0:00:00.045)       0:06:17.021 *******

That was over 1 hour ago, no progress since then...

/var/log/slurm exists, but it entirely empty.

Running processes:

``` root 1515 0.0 1.0 372592 40816 ? Ss 14:12 0:00 /usr/libexec/platform-python /usr/bin/cloud-init modules --mode=final root 1997 0.0 0.0 217052 732 ? S 14:12 0:00 \_ tee -a /var/log/cloud-init-output.log root 2037 0.0 0.0 235744 3412 ? S 14:12 0:00 \_ /bin/bash /var/lib/cloud/instance/scripts/part-001 root 4767 0.0 0.9 406240 34832 ? S 14:12 0:00 \_ /usr/bin/python3 -u /usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml root 9929 7.3 1.6 590508 61548 ? Sl 14:12 5:24 \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1.com root 27615 0.0 1.4 583004 54488 ? S 14:19 0:00 \_ /usr/bin/python3.6 /usr/bin/ansible-playbook -c local /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/management.yml -t all -l localhost,mgmt,ip-10-0-16-0,ip-10-0-16-0.eu-west-1 root 27616 0.0 0.0 235744 3372 ? S 14:19 0:00 \_ /bin/sh -c /usr/libexec/platform-python && sleep 0 root 27617 0.0 0.8 415588 30484 ? S 14:19 0:00 \_ /usr/libexec/platform-python dirsrv 17078 0.1 2.1 662068 81740 ? Ssl 14:14 0:06 /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-mgmt -i /run/dirsrv/slapd-mgmt.pid citc 17138 0.0 0.2 93904 9968 ? Ss 14:15 0:00 /usr/lib/systemd/systemd --user citc 17142 0.0 0.1 257440 5068 ? S 14:15 0:00 \_ (sd-pam) mysql 21671 0.0 2.4 1776020 93568 ? Ssl 14:15 0:01 /usr/libexec/mysqld --basedir=/usr munge 22577 0.0 0.1 125220 4048 ? Sl 14:17 0:00 /usr/sbin/munged root 24674 0.0 1.0 509096 41380 ? Ssl 14:17 0:00 /usr/libexec/platform-python -s /usr/sbin/firewalld --nofork --nopid root 27703 0.0 0.0 232532 2036 ? Ss 15:01 0:00 /usr/sbin/anacron -s ```

Any suggestions on how to figure out what went wrong?

milliams commented 3 years ago

The first thing to be aware of is that the writing of the log file seems to suffer some buffering issues sometimes so the latest thing printed in there is not necessarily the latest task run.

The processes running there all make sense and I don't see any that would be likely to cause problems.

My two ideas for debugging it are:

check lsof to see if there's anything that give a hint as to what is hanging
kill the Ansible run and run it again manually.

To run Ansible manually, sudo to root and, from root's home directory run:

/usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml

I have made some changes to the Ansible in the last few days but the tests I've run on Google and Oracle have worked without issue.

boegel commented 3 years ago

Thanks a lot for the quick feedback!

I checked with lsof, but couldn't seem to find any clues on what went wrong...

I restarted the Ansible playbook, and it's definitely progressing now; it's currently building the initial compute node image:

TASK [finalise : Wait for packer to finish]

I also see that the packer instance was started.

If I check where it was hanging previously, it seems like it didn't manage to get passed the slurm: open all ports task for some reason, since it now indicates that changes were made there?

...

TASK [set slurm log directory permissions] *************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:19 +0000 (0:00:00.507)       0:00:37.562 *******
ok: [mgmt.clever-pipefish.citc.local]

TASK [set slurm spool directory permissions] ***********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:19 +0000 (0:00:00.233)       0:00:37.795 *******
ok: [mgmt.clever-pipefish.citc.local]

TASK [set slurmd config directory permissions] *********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.255)       0:00:38.051 *******
skipping: [mgmt.clever-pipefish.citc.local]

TASK [slurm : open all ports] **************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.042)       0:00:38.093 *******
changed: [mgmt.clever-pipefish.citc.local]

TASK [slurm : include_tasks] ***************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:20 +0000 (0:00:00.765)       0:00:38.859 *******
included: /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/roles/slurm/tasks/elastic.yml for mgmt.clever-pipefish.citc.local

TASK [slurm : install common tools] ********************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021  15:59:21 +0000 (0:00:00.072)       0:00:38.932 *******
changed: [mgmt.clever-pipefish.citc.local]

...

milliams commented 3 years ago

That is indeed strange that it would hang on calling firewall-cmd, especially it happening twice in a row. Hopefully it runs to the end now.

boegel commented 3 years ago

All good now...

Any suggestions on how to debug this further if it occurs again?

milliams commented 3 years ago

One thing I've seen mentioned in my searches is the issue of available RAM. It's worth a check to see if it is running low on memory. In that case, we might need to bump up the instance type a notch.

boegel commented 3 years ago

I'll close this for now, if it happens again I'll get back to it...

boegel commented 3 years ago

Opening this again, since the problem seems persistent...

I've started 3 clusters today, all showed the same problem: installation "hangs" at (or right after) "slurm : open all ports".

Last syslog entries related to ansible:

Feb 19 19:47:39 ip-10-0-70-26 platform-python[27532]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.0596786-27517-206051842621501/source dest=/etc/slurm/slurmdbd.conf owner=slurm group=slurm mode=256 follow=False _original_basename=slurmdbd.conf.j2 checksum=153cbb7266311ee247d8e07a342de8f5c1665b73 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27549]: ansible-stat Invoked with path=/etc/slurm/cgroup.conf follow=False get_checksum=True checksum_algorithm=sha1 get_md5=False get_mime=True get_attributes=True
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27554]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.565857-27539-164394628155763/source dest=/etc/slurm/cgroup.conf owner=slurm group=slurm mode=256 follow=False _original_basename=cgroup.conf.j2 checksum=d8c0923ce4d0c61ce36025522d610963e987e556 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27563]: ansible-file Invoked with path=/var/log/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27568]: ansible-file Invoked with path=/var/spool/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None

The /var/spool/slurm/ directory was created, but somehow it's stuck after that?

@milliams Any idea how I can check whether it actually completed the firewalld configuration where it seems to be stuck on?

clusterinthecloud / support

installation of new cluster doesn't complete #34