after complete run, can't run again due to ssh errors

thoraxe commented 9 years ago

GATHERING FACTS *************************************************************** 
fatal: [cluster_hosts] => SSH Error: ssh: Could not resolve hostname cluster_hosts: Name or service not known
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

PLAY [localhost] ************************************************************** 

TASK: [fail ] ***************************************************************** 
skipping: [localhost]

TASK: [add_host ] ************************************************************* 
ok: [localhost] => (item=cluster_hosts)

PLAY [Post register host(s)] ************************************************** 

GATHERING FACTS *************************************************************** 
FATAL: no hosts matched or all hosts have already failed -- aborting

TASK: [Enable rhui extras channel] ******************************************** 
FATAL: no hosts matched or all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/home/ec2-user/openshift_setup.retry

cluster_hosts              : ok=0    changed=0    unreachable=1    failed=0
localhost                  : ok=9    changed=1    unreachable=0    failed=0

After a complete run that goes all the way through, if you try to run the playbook again you get the above. I'm not sure if this was related to our nuking the cache or what.

thoraxe commented 9 years ago

So, looking at the error more in detail, we fail in the playbooks/util_playbooks/register_host.yml file on the "Register host(s)" play:

PLAY [Register host(s)] ******************************************************* 

GATHERING FACTS *************************************************************** 
fatal: [cluster_hosts] => SSH Error: ssh: Could not resolve hostname cluster_hosts: Name or service not known
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

-vvvv doesn't give us much:

PLAY [Register host(s)] ******************************************************* 

GATHERING FACTS *************************************************************** 
<cluster_hosts> ESTABLISH CONNECTION FOR USER: openshift
<cluster_hosts> REMOTE_MODULE setup
<cluster_hosts> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=900s -o ControlPath="/home/ec2-user/.ansible/cp/%h-%r" -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuth
entications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=openshift -o ConnectTimeout=10 cluster_hosts /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1439914223.44-78
403987537528 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1439914223.44-78403987537528 && echo $HOME/.ansible/tmp/ansible-tmp-1439914223.44-78403987537528'
fatal: [cluster_hosts] => SSH Error: ssh: Could not resolve hostname cluster_hosts: Name or service not known
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.

It looks like cluster_hosts isn't getting properly evaluated, but I don't see where that has really changed anywhere. I think that the hosts aren't actually making it into the ec2 cache or something, perhaps because of the early failure during the part where openshift-ansible is being run by demo-ansible.... or something?

thoraxe commented 9 years ago

Is this fixed by the ec2.ini stuff?

detiber commented 9 years ago

@thoraxe, yes fixed by the ec2.ini changes.

2015-Middleware-Keynote / demo-ansible

after complete run, can't run again due to ssh errors #44