clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

AWS Failures due to Ansible kill_all_nodes script #17

Closed joshes closed 4 years ago

joshes commented 4 years ago

Installing Slurm to AWS via Terraform per the documentation.

Ansible logs look fine except for one entry below:

TASK [slurm : install kill_all_nodes script] ***********************************
Saturday 07 November 2020  20:35:04 +0000 (0:00:00.525)       0:04:04.230 *****
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
fatal: [mgmt.driving-doe.citc.local]: FAILED! => changed=false
  msg: |-
    Could not find or access 'kill_all_nodes.py'
    Searched in:
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/roles/slurm/files/kill_all_nodes.py
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/roles/slurm/kill_all_nodes.py
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/roles/slurm/tasks/files/kill_all_nodes.py
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/roles/slurm/tasks/kill_all_nodes.py
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/files/kill_all_nodes.py
            /root/.ansible/pull/ip-10-0-70-108.ec2.internal/kill_all_nodes.py on the Ansible Controller.
    If you are using a module and expect the file to exist on the remote, see the remote_src option

Additionally, I'm unable to run finish as the first statement is never true. I'm not sure if this is a side-effect of the above or not, but presumably so.

MGMT_HOSTNAME = "mgmt"

## This is NEVER created

if not os.path.isfile("/mnt/shared/finalised/" + MGMT_HOSTNAME):
    print('Error: The management node has not finished its setup')
    print('Please allow it to finish before continuing.')
    print('For information about why they have not finished, check the file /root/ansible-pull.log')
    exit(1)

My ansible_branch is set to 6 if that has any impact.

christopheredsall commented 4 years ago

Thanks very much @joshes for logging this issue and providing a PR

I can reproduce.

I'm not sure if this is a side-effect of the above or not

Yes, it is a side effect, the ansible run didn't succeed so the file didn't get created.