Open boegel opened 3 years ago
The first thing to be aware of is that the writing of the log file seems to suffer some buffering issues sometimes so the latest thing printed in there is not necessarily the latest task run.
The processes running there all make sense and I don't see any that would be likely to cause problems.
My two ideas for debugging it are:
lsof
to see if there's anything that give a hint as to what is hangingTo run Ansible manually, sudo to root and, from root's home directory run:
/usr/bin/ansible-pull --url=https://github.com/clusterinthecloud/ansible.git --checkout=6 --inventory=/root/hosts management.yml
I have made some changes to the Ansible in the last few days but the tests I've run on Google and Oracle have worked without issue.
Thanks a lot for the quick feedback!
I checked with lsof
, but couldn't seem to find any clues on what went wrong...
I restarted the Ansible playbook, and it's definitely progressing now; it's currently building the initial compute node image:
TASK [finalise : Wait for packer to finish]
I also see that the packer instance was started.
If I check where it was hanging previously, it seems like it didn't manage to get passed the slurm: open all ports
task for some reason, since it now indicates that changes were made there?
...
TASK [set slurm log directory permissions] *************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:19 +0000 (0:00:00.507) 0:00:37.562 *******
ok: [mgmt.clever-pipefish.citc.local]
TASK [set slurm spool directory permissions] ***********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:19 +0000 (0:00:00.233) 0:00:37.795 *******
ok: [mgmt.clever-pipefish.citc.local]
TASK [set slurmd config directory permissions] *********************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:20 +0000 (0:00:00.255) 0:00:38.051 *******
skipping: [mgmt.clever-pipefish.citc.local]
TASK [slurm : open all ports] **************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:20 +0000 (0:00:00.042) 0:00:38.093 *******
changed: [mgmt.clever-pipefish.citc.local]
TASK [slurm : include_tasks] ***************************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:20 +0000 (0:00:00.765) 0:00:38.859 *******
included: /root/.ansible/pull/ip-10-0-16-0.eu-west-1.compute.internal/roles/slurm/tasks/elastic.yml for mgmt.clever-pipefish.citc.local
TASK [slurm : install common tools] ********************************************************************************************************************************************************************************************************************************************
Friday 19 February 2021 15:59:21 +0000 (0:00:00.072) 0:00:38.932 *******
changed: [mgmt.clever-pipefish.citc.local]
...
That is indeed strange that it would hang on calling firewall-cmd
, especially it happening twice in a row. Hopefully it runs to the end now.
All good now...
Any suggestions on how to debug this further if it occurs again?
One thing I've seen mentioned in my searches is the issue of available RAM. It's worth a check to see if it is running low on memory. In that case, we might need to bump up the instance type a notch.
I'll close this for now, if it happens again I'll get back to it...
Opening this again, since the problem seems persistent...
I've started 3 clusters today, all showed the same problem: installation "hangs" at (or right after) "slurm : open all ports
".
Last syslog entries related to ansible
:
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27532]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.0596786-27517-206051842621501/source dest=/etc/slurm/slurmdbd.conf owner=slurm group=slurm mode=256 follow=False _original_basename=slurmdbd.conf.j2 checksum=153cbb7266311ee247d8e07a342de8f5c1665b73 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27549]: ansible-stat Invoked with path=/etc/slurm/cgroup.conf follow=False get_checksum=True checksum_algorithm=sha1 get_md5=False get_mime=True get_attributes=True
Feb 19 19:47:39 ip-10-0-70-26 platform-python[27554]: ansible-copy Invoked with src=/root/.ansible/tmp/ansible-tmp-1613764059.565857-27539-164394628155763/source dest=/etc/slurm/cgroup.conf owner=slurm group=slurm mode=256 follow=False _original_basename=cgroup.conf.j2 checksum=d8c0923ce4d0c61ce36025522d610963e987e556 backup=False force=True content=NOT_LOGGING_PARAMETER validate=None directory_mode=None remote_src=None local_follow=None seuser=None serole=None selevel=None setype=None attributes=None regexp=None delimiter=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27563]: ansible-file Invoked with path=/var/log/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None
Feb 19 19:47:40 ip-10-0-70-26 platform-python[27568]: ansible-file Invoked with path=/var/spool/slurm/ state=directory owner=slurm group=slurm mode=493 recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _original_basename=None _diff_peek=None src=None modification_time=None access_time=None seuser=None serole=None selevel=None setype=None attributes=None content=NOT_LOGGING_PARAMETER backup=None remote_src=None regexp=None delimiter=None directory_mode=None unsafe_writes=None
The /var/spool/slurm/
directory was created, but somehow it's stuck after that?
@milliams Any idea how I can check whether it actually completed the firewalld
configuration where it seems to be stuck on?
I've made two attempts this afternoon to create a new CitC on AWS using the one-click installer, but for some reason the installation "hangs".
The management node is being created, and I can SSH into that, but the
finish
command keep producing this (with or without alimits.yaml
file):The last part in
/root/ansible-pull.log
is this:That was over 1 hour ago, no progress since then...
/var/log/slurm
exists, but it entirely empty.Running processes:
Any suggestions on how to figure out what went wrong?