Open christopheredsall opened 4 years ago
Indeed the directory exists
[root@vm-gpu3-2-ad2-0001 ~]# ls -ld /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com
drwxr-xr-x. 6 root root 4096 Jun 13 16:27 /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com
Moving it aside and re-pulling
[root@vm-gpu3-2-ad2-0001 ~]# mv /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com /root/.ansible/pull/BROKEN-vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com
[root@vm-gpu3-2-ad2-0001 ~]# /usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=4 --inventory=/root/hosts compute.yml
Results in exactly the same error and log output
I've seen this before and I'm still not sure what causes it. It seems like sometimes the cloud-init
script is started twice.
With the new work to pre-generate images it will be less of a problem but putting in a file lock to prevent the race condition could help too.
On a newly built cluster using ACRC/citc-terraform@e3134045454004af0e51932ebf214853eb93461d with the default "4" branch of ACRC/slurm-ansible-playbook
Submitting a job to start the node results in the following
/root/ansible-pull.log