Failing on three out of 6 nodes

juhoffma commented 8 years ago

Hi,

constantly seeing when trying to deploy a new instance. Tried it 3 times, and always 3 nodes failed, the other three went fine:

PLAY [localhost] **************************************************************

TASK: [fail ] *****************************************************************
skipping: [localhost]

TASK: [add_host ] *************************************************************
ok: [localhost] => (item=openshift-master.demo.openshift.me)
ok: [localhost] => (item=openshift-node-infra-4ef832f3.demo.openshift.me)
ok: [localhost] => (item=openshift-node-demo-0bf933b6.demo.openshift.me)
ok: [localhost] => (item=openshift-node-demo-0af933b7.demo.openshift.me)
ok: [localhost] => (item=openshift-node-demo-05f933b8.demo.openshift.me)
ok: [localhost] => (item=openshift-node-demo-04f933b9.demo.openshift.me)
ok: [localhost] => (item=openshift-node-demo-07f933ba.demo.openshift.me)

PLAY [Register host(s)] *******************************************************

GATHERING FACTS ***************************************************************
fatal: [openshift-node-demo-05f933b8.demo.openshift.me] => SSH Error: unix_listener: "/Users/buddy/.ansible/cp/ec2-52-59-250-241.eu-central-1.compute.amazonaws.com-openshift.sO35AS7bDYJ6KTx0" too long for Unix domain socket
    while connecting to 52.59.250.241:22
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
fatal: [openshift-master.demo.openshift.me] => SSH Error: unix_listener: "/Users/buddy/.ansible/cp/ec2-52-59-244-234.eu-central-1.compute.amazonaws.com-openshift.0nLGxkz7X4gv6nyk" too long for Unix domain socket
    while connecting to 52.59.244.234:22
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
fatal: [openshift-node-demo-0bf933b6.demo.openshift.me] => SSH Error: unix_listener: "/Users/buddy/.ansible/cp/ec2-52-59-248-167.eu-central-1.compute.amazonaws.com-openshift.YdYRF1ek8IHPyhmj" too long for Unix domain socket
    while connecting to 52.59.248.167:22
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
fatal: [openshift-node-infra-4ef832f3.demo.openshift.me] => SSH Error: unix_listener: "/Users/buddy/.ansible/cp/ec2-52-59-244-139.eu-central-1.compute.amazonaws.com-openshift.hyFP3VTwPBdNoPFj" too long for Unix domain socket
    while connecting to 52.59.244.139:22
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
ok: [openshift-node-demo-07f933ba.demo.openshift.me]
ok: [openshift-node-demo-04f933b9.demo.openshift.me]
ok: [openshift-node-demo-0af933b7.demo.openshift.me]

TASK: [Register host] *********************************************************

juhoffma commented 8 years ago

Leading to the result:

PLAY [Configure etcd certificates] ********************************************
skipping: no hosts matched

PLAY [Configure etcd hosts] ***************************************************
skipping: no hosts matched

PLAY [Delete temporary directory on localhost] ********************************

TASK: [file name={{ g_etcd_mktemp.stdout }} state=absent] *********************
ok: [localhost]

PLAY [Configure nfs hosts] ****************************************************
skipping: no hosts matched

PLAY [Set master facts and determine if external etcd certs need to be generated] ***

TASK: [Check for RPM generated config marker file .config_managed] ************
FATAL: no hosts matched or all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
Register host ---------------------------------------------------------- 66.06s
Create ec2 instance ---------------------------------------------------- 61.87s
docker | Install docker ------------------------------------------------ 60.51s
Wait for ssh ----------------------------------------------------------- 36.84s
openshift_common | Install the base package for versioning ------------- 24.38s
Subscribe only to the ose repo ----------------------------------------- 18.78s
Disable all known rhsm repos ------------------------------------------- 18.42s
os_firewall | Install iptables packages -------------------------------- 15.81s
Wait for user setup ---------------------------------------------------- 15.67s
os_firewall | need to pause here, otherwise the iptables service starting can sometimes cause ssh to fail -- 10.03s
           to retry, use: --limit @/Users/buddy/openshift_setup.retry

localhost                  : ok=29   changed=10   unreachable=0    failed=0
openshift-master.demo.openshift.me : ok=0    changed=0    unreachable=1    failed=0
openshift-node-demo-04f933b9.demo.openshift.me : ok=46   changed=17   unreachable=0    failed=0
openshift-node-demo-05f933b8.demo.openshift.me : ok=0    changed=0    unreachable=1    failed=0
openshift-node-demo-07f933ba.demo.openshift.me : ok=46   changed=17   unreachable=0    failed=0
openshift-node-demo-0af933b7.demo.openshift.me : ok=46   changed=17   unreachable=0    failed=0
openshift-node-demo-0bf933b6.demo.openshift.me : ok=0    changed=0    unreachable=1    failed=0
openshift-node-infra-4ef832f3.demo.openshift.me : ok=0    changed=0    unreachable=1    failed=0

juhoffma commented 8 years ago

After reading through this ansible github issue

Setting control_path to the following value fixed it for me

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=900s
#control_path = %(directory)s/%%h-%%r
control_path = %(directory)s/%%C

in ansible.cfg fixed the problem

2015-Middleware-Keynote / demo-ansible

Failing on three out of 6 nodes #80