clusterinthecloud / ansible

Ansible config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
10 stars 26 forks source link

Ansible pull fails on compute node - work tree dir File exists #76

Open christopheredsall opened 4 years ago

christopheredsall commented 4 years ago

On a newly built cluster using ACRC/citc-terraform@e3134045454004af0e51932ebf214853eb93461d with the default "4" branch of ACRC/slurm-ansible-playbook

Submitting a job to start the node results in the following /root/ansible-pull.log

Starting Ansible Pull at 2020-06-13 16:27:40
/usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=4 --inventory=/root/hosts compute.yml
 [WARNING]: Platform linux on host vm-gpu3-2-ad2-0001 is using the discovered
Python interpreter at /usr/bin/python, but future installation of another
Python interpreter could change this. See https://docs.ansible.com/ansible/2.8/
reference_appendices/interpreter_discovery.html for more information.
vm-gpu3-2-ad2-0001 | FAILED! => {
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "changed": false, 
    "cmd": "/usr/bin/git clone --origin origin https://github.com/ACRC/slurm-ansible-playbook.git /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com", 
    "msg": "fatal: could not create work tree dir '/root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com'.: File exists", 
    "rc": 128, 
    "stderr": "fatal: could not create work tree dir '/root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com'.: File exists\n", 
    "stderr_lines": [
        "fatal: could not create work tree dir '/root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com'.: File exists"
    ], 
    "stdout": "", 
    "stdout_lines": []
}
 [WARNING]: Your git version is too old to fully support the depth argument.
Falling back to full checkouts.
 [WARNING]: Platform linux on host vm-
gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com is using the discovered Python
interpreter at /usr/bin/python, but future installation of another Python
interpreter could change this. See https://docs.ansible.com/ansible/2.8/referen
ce_appendices/interpreter_discovery.html for more information.
vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com | CHANGED => {
    "after": "fe5fc5bb46fec69c6db9465782793f312134a3f5", 
    "ansible_facts": {
        "discovered_interpreter_python": "/usr/bin/python"
    }, 
    "before": null, 
    "changed": true
}
christopheredsall commented 4 years ago

Indeed the directory exists

[root@vm-gpu3-2-ad2-0001 ~]# ls -ld /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com
drwxr-xr-x. 6 root root 4096 Jun 13 16:27 /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com

Moving it aside and re-pulling

[root@vm-gpu3-2-ad2-0001 ~]# mv /root/.ansible/pull/vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com /root/.ansible/pull/BROKEN-vm-gpu3-2-ad2-0001.subnet.clustervcn.oraclevcn.com
[root@vm-gpu3-2-ad2-0001 ~]# /usr/bin/ansible-pull --url=https://github.com/ACRC/slurm-ansible-playbook.git --checkout=4 --inventory=/root/hosts compute.yml

Results in exactly the same error and log output

christopheredsall commented 4 years ago

sosreport-mgmt-76-2020-06-13-cbtjypk.tar.gz

milliams commented 4 years ago

I've seen this before and I'm still not sure what causes it. It seems like sometimes the cloud-init script is started twice.

With the new work to pre-generate images it will be less of a problem but putting in a file lock to prevent the race condition could help too.