alv000h commented 8 years ago

ISSUE TYPE

Bug Report
Feature Request

COMPONENT NAME

ansible

ANSIBLE VERSION

ansible 2.3.0 (devel 5fa82f2b4e) last updated 2016/10/31 12:01:58 (GMT +200) lib/ansible/modules/core: (detached HEAD f9fada7db0) last updated 2016/10/31 12:02:21 (GMT +200) lib/ansible/modules/extras: (detached HEAD 5701a2e050) last updated 2016/10/31 12:02:41 (GMT +200) config file = /etc/ansible/ansible.cfg configured module search path = Default w/o overrides

CONFIGURATION

ANSIBLE_DEBUG=1

OS / ENVIRONMENT

RHEL 7.2 (Maipo) Python 2.7.5 (default, Oct 11 2015, 17:47:16)

SUMMARY

Ansible Fails to Timeout when SSH conection is OK, but command executes during long (infinite) time. Common errors:

setup.sh - long (infinite) lsblk execution when server have I/O problems
yum - yum repolist or install stucks in infinite FUTEX because problems in server or yum database.
Ansible hanging on repoquery during yum command #12858
ssh + sudo password can hang indefinitely #9544

A command execution Timeout should be provided, like suggested here (adding timeout to subprocess): https://github.com/ansible/ansible/issues/9544

Something like this will be nice: *wait for 3600 seconds (or desired time) to finish task, if it stucks, send kill -9 to subprocess.

STEPS TO REPRODUCE

Target server: $ mv /usr/bin/rpm /usr/bin/rpm.orig

$ vi /usr/bin/rpm

!/bin/bash

while true; do sleep 5; done

$ chmod a+x /usr/bin/rpm

Control Server: *Launch ansible and try to remove any packet... ---> IT stucks <----

TASK [Disable prelink | yum erase prelink] ***
 31985 1477917242.42351: sending task start callback
 31985 1477917242.42358: entering _queue_task() for Maquinita.servers.com/yum
 31985 1477917242.42540: worker is 1 (out of 1 available)
 31985 1477917242.42580: exiting _queue_task() for Maquinita.servers.com/yum
 31985 1477917242.42600: done queuing things up, now waiting for results queue to drain
 31985 1477917242.42607: waiting for pending results...
  6380 1477917242.43023: running TaskExecutor() for Maquinita.servers.com/TASK: Disable prelink | yum erase prelink
  6380 1477917242.43127: in run()
  6380 1477917242.43210: calling self._execute()
  6380 1477917242.43411: Loading FilterModule 'core' from /opt/ansible/ansible22/lib/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
  6380 1477917242.43430: Loading FilterModule 'ipaddr' from /opt/ansible/ansible22/lib/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
  6380 1477917242.43443: Loading FilterModule 'json_query' from /opt/ansible/ansible22/lib/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
  6380 1477917242.43458: Loading FilterModule 'mathstuff' from /opt/ansible/ansible22/lib/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
  6380 1477917242.43508: Loading TestModule 'core' from /opt/ansible/ansible22/lib/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
  6380 1477917242.43521: Loading TestModule 'files' from /opt/ansible/ansible22/lib/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
  6380 1477917242.43536: Loading TestModule 'mathstuff' from /opt/ansible/ansible22/lib/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
  6380 1477917242.45836: Loading Connection 'ssh' from /opt/ansible/ansible22/lib/ansible/plugins/connection/ssh.py
  6380 1477917242.46137: Loading ShellModule 'csh' from /opt/ansible/ansible22/lib/ansible/plugins/shell/csh.py
  6380 1477917242.46283: Loading ShellModule 'fish' from /opt/ansible/ansible22/lib/ansible/plugins/shell/fish.py
  6380 1477917242.46523: Loading ShellModule 'powershell' from /opt/ansible/ansible22/lib/ansible/plugins/shell/powershell.py
  6380 1477917242.46605: Loading ShellModule 'sh' from /opt/ansible/ansible22/lib/ansible/plugins/shell/sh.py
  6380 1477917242.46636: Loading ShellModule 'sh' from /opt/ansible/ansible22/lib/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
  6380 1477917242.46661: in VariableManager get_vars()
  6380 1477917242.46713: done with get_vars()
  6380 1477917242.46782: Loading ActionModule 'normal' from /opt/ansible/ansible22/lib/ansible/plugins/action/normal.py
  6380 1477917242.46800: starting attempt loop
  6380 1477917242.46810: running the handler
  6380 1477917242.46870: ANSIBALLZ: Using lock for yum
  6380 1477917242.46882: ANSIBALLZ: Acquiring lock
  6380 1477917242.46894: ANSIBALLZ: Lock acquired: 42769040
  6380 1477917242.46905: ANSIBALLZ: Creating module
  6380 1477917242.64700: ANSIBALLZ: Writing module
  6380 1477917242.64734: ANSIBALLZ: Renaming module
  6380 1477917242.64757: ANSIBALLZ: Done creating module
  6380 1477917242.64847: _low_level_execute_command(): starting
  6380 1477917242.64864: _low_level_execute_command(): executing: /bin/sh -c '( umask 77 && mkdir -p "` echo /var/tmp/$USER/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846 `" && echo ansible-tmp-1477917242.65-27858957513846="` echo /var/tmp/$USER/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846 `" ) && sleep 0'
  6380 1477917242.69932: stdout chunk (state=2):
>>>ansible-tmp-1477917242.65-27858957513846=/var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846
<<<

  6380 1477917242.70141: stdout chunk (state=3):
>>><<<

  6380 1477917242.70149: stderr chunk (state=3):
>>><<<

  6380 1477917242.70168: _low_level_execute_command() done: rc=0, stdout=ansible-tmp-1477917242.65-27858957513846=/var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846
, stderr=
  6380 1477917242.70185: transferring module to remote /var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846/yum.py
  6380 1477917242.70735: Sending initial data
  6380 1477917242.70748: Sent initial data (94 bytes)
  6380 1477917242.77859: stdout chunk (state=3):
>>>sftp> put /tmp/tmp2hNGbt /var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846/yum.py
<<<

  6380 1477917242.79106: stdout chunk (state=3):
>>><<<

  6380 1477917242.79115: stderr chunk (state=3):
>>><<<

  6380 1477917242.79136: done transferring module to remote
  6380 1477917242.79161: _low_level_execute_command(): starting
  6380 1477917242.79170: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846/ /var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846/yum.py && sleep 0'
  6380 1477917242.87016: stdout chunk (state=2):
>>><<<

  6380 1477917242.87031: stderr chunk (state=2):
>>><<<

  6380 1477917242.87052: _low_level_execute_command() done: rc=0, stdout=, stderr=
  6380 1477917242.87067: _low_level_execute_command(): starting
  6380 1477917242.87079: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /var/tmp/root/.ansible/tmp/ansible-tmp-1477917242.65-27858957513846/yum.py && sleep 0'
^CProcess WorkerProcess-14:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
 31985 1477917335.26199: RUNNING CLEANUP
 [ERROR]: User interrupted execution

real    19m24.667s
user    0m41.122s
sys     0m3.459s

EXPECTED RESULTS

I expect task timeout and mark the host as failed

ACTUAL RESULTS

Ansible wait to task finalization during infinite time.

This impacts in several nightly support tasks, because playbooks not end... (requires manual intervention).

mogthesprog commented 7 years ago

I have a similar issue, not sure if it's related to the same underlying issue though. We're running ansible in the cloud via a bastion hosts, and are seeing a similar timeout issue, we see the following debug output for a task:

morganj@AMS-02-DHI-0003> cat ansible-debug.out| grep "51595"                                                                                                                                ~/dev/ceng-devtest
 51595 1489493352.98877: running TaskExecutor() for ceng-gcp-data-platform-airflow/TASK: rsyslog : install
 51595 1489493352.98947: in run()
 51595 1489493352.98996: calling self._execute()
 51595 1489493353.00023: Loading Connection 'ssh' from /usr/lib/python2.7/site-packages/ansible/plugins/connection/ssh.py
 51595 1489493353.00743: Loading ShellModule 'csh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/csh.py
 51595 1489493353.00796: Loading ShellModule 'fish' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/fish.py
 51595 1489493353.00856: Loading ShellModule 'powershell' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/powershell.py
 51595 1489493353.00890: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py
 51595 1489493353.00960: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 51595 1489493353.00983: in VariableManager get_vars()
 51595 1489493353.01682: done with get_vars()
 51595 1489493353.01729: Loading ActionModule 'normal' from /usr/lib/python2.7/site-packages/ansible/plugins/action/normal.py
 51595 1489493353.01748: starting attempt loop
 51595 1489493353.01761: running the handler
 51595 1489493353.01811: ANSIBALLZ: using cached module: /home/morganj/.ansible/tmp/ansible-local-50907dBX0ja/ansiballz_cache/yum-ZIP_DEFLATED
 51595 1489493353.01911: _low_level_execute_command(): starting
 51595 1489493353.01928: _low_level_execute_command(): executing: /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823 `" && echo ansible-tmp-1489493353.02-200902091435823="` echo $HOME/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823 `" ) && sleep 0'
 51595 1489493353.15604: stderr chunk (state=2):
 51595 1489493353.46941: stderr chunk (state=3):
 51595 1489493353.82566: stdout chunk (state=3):
 51595 1489493353.82614: stdout chunk (state=3):
 51595 1489493353.82642: stderr chunk (state=3):
 51595 1489493353.82657: stderr chunk (state=3):
 51595 1489493353.82684: _low_level_execute_command() done: rc=0, stdout=ansible-tmp-1489493353.02-200902091435823=/home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823
 51595 1489493353.82703: transferring module to remote /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py
 51595 1489493353.93274: stderr chunk (state=2):
 51595 1489493354.23288: stderr chunk (state=3):
 51595 1489493354.68999: stdout chunk (state=3):
 51595 1489493354.69028: stderr chunk (state=3):
 51595 1489493354.69044: stderr chunk (state=3):
 51595 1489493354.69075: done transferring module to remote
 51595 1489493354.69108: _low_level_execute_command(): starting
 51595 1489493354.69123: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/ /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py && sleep 0'
 51595 1489493354.79344: stderr chunk (state=2):
 51595 1489493355.09285: stderr chunk (state=3):
 51595 1489493355.45604: stdout chunk (state=3):
 51595 1489493355.45642: stderr chunk (state=3):
 51595 1489493355.45656: stderr chunk (state=3):
 51595 1489493355.45681: _low_level_execute_command() done: rc=0, stdout=, stderr=Warning: Permanently added '104.199.104.71' (ECDSA) to the list of known hosts.
 51595 1489493355.45704: _low_level_execute_command(): starting
 51595 1489493355.45716: _low_level_execute_command(): using become for this command
 51595 1489493355.45741: _low_level_execute_command(): executing: /bin/sh -c 'sudo -H -S -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq; https_proxy=https://10.0.0.5:3128 http_proxy=http://10.0.0.5:3128 /usr/bin/python /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py; rm -rf "/home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/" > /dev/null 2>&1'"'"' && sleep 0'
 51595 1489493355.46218: Initial state: awaiting_escalation: BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq
 51595 1489493355.56544: stderr chunk (state=1):
 51595 1489493355.87374: stderr chunk (state=1):
 51595 1489493356.23825: stdout chunk (state=1):
 51595 1489493356.23855: become_success: (source=stdout, state=awaiting_escalation): 'BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq'
 51595 1489493356.23866: Escalation succeeded

This is a yum tasks to install rsyslog. If you need anymore information, or think i should raise another issue, then please let me know. Also happy to do more debugging if necessary.

Cheers

avoidik commented 7 years ago

2.2.2.0 is also affected and also via bastion

problematic task:

- name: Execute script
  command: chdir=/tmp sh script.sh

script itself

#!/bin/bash
PARTITION="/dev/xvdf1"
PARTITION_NR="${PARTITION:~0}"
MOUNT_DRIVE="${PARTITION%?}"
MOUNT_DIR="/opt"
if [ ! -b $PARTITION ]
then
        parted $MOUNT_DRIVE --script mklabel gpt mkpart product -- 0 -1
        mkfs.ext4 $PARTITION
        echo "$PARTITION $MOUNT_DIR ext4 defaults 1 3" >> /etc/fstab
        mkdir -p $MOUNT_DIR
        mount $MOUNT_DIR
        chmod o+w $MOUNT_DIR
else
        echo "$PARTITION exists"
fi

when I run it manually everything is okay

infrascripting commented 7 years ago

IMO This is needed in general, in the sense that if I'm running a playbook against many machines and one hangs, the entire play is ruined. A forced timeout would allow the rest of the hosts to succeed, and then I can go back and manually debug the failed server.

I can't tell you how frustrating it is when I have to kill a playbook for this very reason. For example, right now I'm running a yum clean all in a playbook. All 500 servers except one ran the command OK. However, the other server doesn't even have the yum process running! I can't kill it even manually on the remote machine! Also, even terminating the ssh connection does not force ansible to continue. I have to take that host out of my inventory in order to have the playbook run successfully for all of the other hosts.

Should one hung host determine the playbook for the rest of them? I don't think so.

jhawkesworth commented 7 years ago

@infrascripting is using strategy: free an option for you? http://docs.ansible.com/ansible/latest/playbooks_strategies.html

infrascripting commented 7 years ago

I was not aware of strategy: free, however, I don't think it'd be feasible in this instance.

Primarily because if I don't have all the servers finish the playbook, there will be no log file written (callback plugin), which is required for auditing purposes. Additionally, it would be more of a hassle to debug where a specific host failed if I have to cross reference every host with where they ended up, as opposed to every host at the task that is hung.

Ultimately, a timeout serves both these purposes by letting the server fail if a command does not execute correctly. In a similar instance as the above, the yum clean all process does exist, however it is hung. Killing it on the remote machine returns a 'FAILED' code to ansible, and lets it continue on with the rest of the play.

infrascripting commented 7 years ago

In almost every case where a bug report is filed that ansible hangs "seemingly forever" there exists the possibility of this timeout returning valuable debugging information during the run - or that is stored in logs (callback plugins)

A cursory search for "hangs" revealed:

the error happened again and ansible hanged indefinitely

-- https://github.com/ansible/ansible/issues/24109

There it hangs, seemingly forever.

-- https://github.com/ansible/ansible/issues/16894

This is a major inconvenience when working with Ansible, because Ansible hangs until the remote machine closes the connection, without error.

-- https://github.com/ansible/ansible/issues/10874

It ll be nice to have an option to specify a hard timeout of the script. That way, you can avoid issues when the script hangs and get stuck forever.

-- https://github.com/ansible/ansible/issues/23676

I keep beating my head against the wall - a few days ago, ansible stopped being able to connect to one of my boxes. It just hangs after the output below.

-- https://github.com/ansible/ansible/issues/17273

The machines just hang on gathering facts [...] I have no idea how long it had been running, could have been weeks or longer.

-- https://github.com/ansible/ansible/issues/24536

EXPECTED RESULTS

Either the copy succeeds or it fails

ACTUAL RESULTS

The copy is stuck. I have to CTRL C to exit.

Also, as a side-effect, a session is left running on my machine:

I have to kill this "[mux]" process for any subsequent ansible command to succeed, otherwise it stays stuck at the SSH connection...

-- https://github.com/ansible/ansible/issues/16929

I know I am not the first to report this, but I have not found any conclusive solution to the issue. Basically during 'gathering facts' ansible will hang on a single host, while executing setup via python on the target node.

-- https://github.com/ansible/ansible/issues/12320

warmchang commented 6 years ago

👍

promaty commented 6 years ago

Still no solution in Ansible 2.4 ?

oswalya commented 6 years ago

Any update on this? I experience also random hanging tasks, especially with yum and package installs. Running at the moment on 2.4.2

ericdong66 commented 6 years ago

I hit same issue as well, ansible task hang on "sftp a file to remote", file is created but size stuck at 0.

work around by reduce interface mtu size, was 9k and reduced to 1500.

k-nishigaki commented 6 years ago

work around by reduce interface mtu size, was 9k and reduced to 1500.

I had this problem, and resolved by your comment. thank you so match.

CentOS 6.9
Ansible 2.4.2 
Python 2.6.6

decentral1se commented 6 years ago

I'll just leave this here.

https://stackoverflow.com/questions/20798774/how-to-detect-why-ansible-playbook-hangs-during-execution

ghost commented 6 years ago

I am facing the same issue while executing a bash script from ansible, Any solutions for this problem

nkakouros commented 6 years ago

I started getting this issue ~3 weeks ago. I "solved" it by setting

ssh_args = -o ControlPath=/var/shm/control:%h:%p:%r

in my ansible.cfg. It seems that there is an issue with the socket the ControlMaster feature of ssh creates. Upon creation on my regular ext4 filesystem, it blocked ansible.

Generally, here is a list of things to check (from the different issues on ansible hunging - almost all have occured to me in the past):

Switch to paramiko in ansible.cfg with the transport option to check if there is an ssh configuration issue on your local machine.
Try setting the ControlPath to point to memory (like I did above) to be sure that there is no issue with your local filesystem (permission, filesystem, etc).
Disable ControlMaster in ansible.cfg with the ssh_args option or delete the ~/.ansible folder on your local host.
Lower the MTU of your interface. Eg, ip link set wlan0 mtu 1000 (instead of eg 1500). You can tell this is the issue if you login normally with ssh and long output (eg when running dmesg) freezes your session.
Add more memory to your remote host.
Try the advice in the link on the previous comment

tinohean commented 6 years ago

hit the same issue while using copy module , ansible 2.6

belooussov commented 6 years ago

+1

Shinerrs commented 6 years ago

Well I can understand how +1 can be anoying, but not as annoying as no proper response about an issue in over 2 years and 6 major versions later.

I am dealing with this issue, I had an issue with https://github.com/ansible/ansible/issues/14426 issue. and then it didnt read in the group vars. so I used inventory_hostname instead and continued. Then I removed gather facts and I got a freeze instead of the error of privilege.

ivan-kleshnin commented 6 years ago

For me, Ansible freezes at:

 raw: '/tmp/ansible.Dr1LTB.build/run doSomething'
    args:
      chdir: "/tmp/ansible.Dr1LTB.build/"

where the following works normally(!):

 command: 'npx run doSomething'
    args:
      chdir: "/tmp/ansible.Dr1LTB.build/"

Both work in the same way when I run them manually.

No logs, no error messages, nothing - kinda embarassing. The freezing of low-level command like raw where high-level script performs fine is quite weird.

Timoses commented 5 years ago

I have a similar issue, not sure if it's related to the same underlying issue though. We're running ansible in the cloud via a bastion hosts, and are seeing a similar timeout issue, we see the following debug output for a task:

morganj@AMS-02-DHI-0003> cat ansible-debug.out| grep "51595"                                                                                                                                ~/dev/ceng-devtest
 51595 1489493352.98877: running TaskExecutor() for ceng-gcp-data-platform-airflow/TASK: rsyslog : install
 51595 1489493352.98947: in run()
 51595 1489493352.98996: calling self._execute()
 51595 1489493353.00023: Loading Connection 'ssh' from /usr/lib/python2.7/site-packages/ansible/plugins/connection/ssh.py
 51595 1489493353.00743: Loading ShellModule 'csh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/csh.py
 51595 1489493353.00796: Loading ShellModule 'fish' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/fish.py
 51595 1489493353.00856: Loading ShellModule 'powershell' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/powershell.py
 51595 1489493353.00890: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py
 51595 1489493353.00960: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 51595 1489493353.00983: in VariableManager get_vars()
 51595 1489493353.01682: done with get_vars()
 51595 1489493353.01729: Loading ActionModule 'normal' from /usr/lib/python2.7/site-packages/ansible/plugins/action/normal.py
 51595 1489493353.01748: starting attempt loop
 51595 1489493353.01761: running the handler
 51595 1489493353.01811: ANSIBALLZ: using cached module: /home/morganj/.ansible/tmp/ansible-local-50907dBX0ja/ansiballz_cache/yum-ZIP_DEFLATED
 51595 1489493353.01911: _low_level_execute_command(): starting
 51595 1489493353.01928: _low_level_execute_command(): executing: /bin/sh -c '( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823 `" && echo ansible-tmp-1489493353.02-200902091435823="` echo $HOME/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823 `" ) && sleep 0'
 51595 1489493353.15604: stderr chunk (state=2):
 51595 1489493353.46941: stderr chunk (state=3):
 51595 1489493353.82566: stdout chunk (state=3):
 51595 1489493353.82614: stdout chunk (state=3):
 51595 1489493353.82642: stderr chunk (state=3):
 51595 1489493353.82657: stderr chunk (state=3):
 51595 1489493353.82684: _low_level_execute_command() done: rc=0, stdout=ansible-tmp-1489493353.02-200902091435823=/home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823
 51595 1489493353.82703: transferring module to remote /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py
 51595 1489493353.93274: stderr chunk (state=2):
 51595 1489493354.23288: stderr chunk (state=3):
 51595 1489493354.68999: stdout chunk (state=3):
 51595 1489493354.69028: stderr chunk (state=3):
 51595 1489493354.69044: stderr chunk (state=3):
 51595 1489493354.69075: done transferring module to remote
 51595 1489493354.69108: _low_level_execute_command(): starting
 51595 1489493354.69123: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/ /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py && sleep 0'
 51595 1489493354.79344: stderr chunk (state=2):
 51595 1489493355.09285: stderr chunk (state=3):
 51595 1489493355.45604: stdout chunk (state=3):
 51595 1489493355.45642: stderr chunk (state=3):
 51595 1489493355.45656: stderr chunk (state=3):
 51595 1489493355.45681: _low_level_execute_command() done: rc=0, stdout=, stderr=Warning: Permanently added '104.199.104.71' (ECDSA) to the list of known hosts.
 51595 1489493355.45704: _low_level_execute_command(): starting
 51595 1489493355.45716: _low_level_execute_command(): using become for this command
 51595 1489493355.45741: _low_level_execute_command(): executing: /bin/sh -c 'sudo -H -S -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq; https_proxy=https://10.0.0.5:3128 http_proxy=http://10.0.0.5:3128 /usr/bin/python /home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/yum.py; rm -rf "/home/cengtest-sshadmin/.ansible/tmp/ansible-tmp-1489493353.02-200902091435823/" > /dev/null 2>&1'"'"' && sleep 0'
 51595 1489493355.46218: Initial state: awaiting_escalation: BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq
 51595 1489493355.56544: stderr chunk (state=1):
 51595 1489493355.87374: stderr chunk (state=1):
 51595 1489493356.23825: stdout chunk (state=1):
 51595 1489493356.23855: become_success: (source=stdout, state=awaiting_escalation): 'BECOME-SUCCESS-fouducionjbihlnkubdbuhrzprvnnfxq'
 51595 1489493356.23866: Escalation succeeded

This is a yum tasks to install rsyslog. If you need anymore information, or think i should raise another issue, then please let me know. Also happy to do more debugging if necessary.

Cheers

@mogthesprog : I know it's almost 2 years ago. However: Do you remember finding a solution to this? I'm having a similar problem which stops at the same spot with yum (https://github.com/ansible/ansible/issues/51284).

therevoman commented 5 years ago

Looks like someone patched ansible 2.8 https://github.com/ansible/ansible/issues/47210

dlewis7444 commented 5 years ago

In my case, I added in my ansible.cfg this statement: ssh_args = " -o StrictHostKeyChecking=no" Without thinking that this would override the default ssh_args options. So, checking /etc/ansible/ansible.cfg, I found the defaults and added them too. Like magic, it stopped hanging. :)

Final setting in my ansible.cfg: ssh_args = "-C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no"

dlewis7444 commented 5 years ago

I spoke too soon - that change got me a little further along but it's still hanging

BenCoffeed commented 5 years ago

I was having this issue with the get_url module and the file module in Ansible 2.7.8. I noticed the pattern that this was failing when I defined a file mode but did not explicitly define the owner and group. Explicitly setting the owner: and group: attributes to their appropriate values resolved the issue for me.

Example:

name: Download oh-my-zsh
  get_url:
    url: https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh
    dest: /tmp/install-zsh.sh
    mode: 0755

results in the waiting pattern, but

name: Download oh-my-zsh
  get_url:
    url: https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh
    dest: /tmp/install-zsh.sh
    owner: vagrant
    group: vagrant
    mode: 0755

does not result in waiting.

dlewis7444 commented 5 years ago

Several of my boxes still hang. I love Ansible but I have to resolve this or it's just not usable. A little more info...

It appears to be related to privilege escalation, at least in my case. If I ssh manually to server1, sudo to root, and rerun ansible with the -b, before the sudo cached password times out, it works. So it doesn't seem to recognize the sudo password prompt on all boxes.

  ansible -vvv server1 -m ping        # succeeds
  ansible -vvv -b server1 -m ping     # hangs at the sudo attempt
                                      # unless I first ssh and sudo manually

Other data points:

We normally use dzdo (Centrify) instead of sudo. But both exhibit the same failure mode.
I updated to Ansible 2.8.0 - problem persists.
Changing to paramiko, adjusting the mtu, playing with the controlmaster settings - none of this helped.
Currently having this problem on SLES 12.2 clients - but I don't know that those are the only ones.

Some output from the command, "ansible -vvvv -b server1 -m ping"

...
<server1> ESTABLISH SSH CONNECTION FOR USER: d_lewis
<server1> SSH: EXEC sshpass -d10 ssh -vvv -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="d_lewis"' -o ConnectTimeout=10 -o ControlPath=/home/dlewis/.ansible/cp/51303f75b6 server1 '/bin/sh -c '"'"'chmod u+x /home/d_lewis/.ansible/tmp/ansible-tmp-1559842970.91-113507000465060/ /home/d_lewis/.ansible/tmp/ansible-tmp-1559842970.91-113507000465060/AnsiballZ_ping.py && sleep 0'"'"''
<server1> (0, '', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /home/dlewis/.ssh/config\r\ndebug1: /home/dlewis/.ssh/config line 58: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 26079\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n')
<server1> ESTABLISH SSH CONNECTION FOR USER: d_lewis
<server1> SSH: EXEC sshpass -d10 ssh -vvv -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="d_lewis"' -o ConnectTimeout=10 -o ControlPath=/home/dlewis/.ansible/cp/51303f75b6 -tt server1 '/bin/sh -c '"'"'dzdo -H -S  -p "[dzdo via ansible, key=ujejcfadoztrfdckegqhcleymbhirwew] password:" -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ujejcfadoztrfdckegqhcleymbhirwew ; /usr/bin/python /home/d_lewis/.ansible/tmp/ansible-tmp-1559842970.91-113507000465060/AnsiballZ_ping.py'"'"'"'"'"'"'"'"' && sleep 0'"'"''

Hangs there for several minutes - I assume until the sudo/dzdo prompt times out.

lapcchan commented 5 years ago

for my case it hangs till timeout.

vincent@vincentMBP # ansible -vvvv serverc -m ping
ansible 2.8.0
  config file = /Users/vincent/tansible/ansible.cfg
  configured module search path = ['/Users/vincent/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/Cellar/ansible/2.8.0/libexec/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.3 (default, Mar 27 2019, 09:23:15) [Clang 10.0.1 (clang-1001.0.46.3)]
Using /Users/vincent/tansible/ansible.cfg as config file
setting up inventory plugins
Parsed /Users/vincent/tansible/host.txt inventory source with ini plugin
Loading callback plugin minimal of type stdout, v2.0 from /usr/local/Cellar/ansible/2.8.0/libexec/lib/python3.7/site-packages/ansible/plugins/callback/minimal.py
META: ran handlers
<serverc> ESTABLISH SSH CONNECTION FOR USER: root
<serverc> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/Users/vincent/.ansible/cp/b17d87e4d6 serverc '/bin/sh -c '"'"'echo ~root && sleep 0'"'"''
<serverc> (0, b'/root\r\n', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data /Users/vincent/.ssh/config\r\ndebug1: /Users/vincent/.ssh/config line 135: Applying options for serverc\r\ndebug1: /Users/vincent/.ssh/config line 183: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 48: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: Control socket "/Users/vincent/.ansible/cp/b17d87e4d6" does not exist\r\ndebug2: resolving "serverc" port 22\r\ndebug2: ssh_connect_direct\r\ndebug1: Connecting to serverc [192.168.3.212] port 22.\r\ndebug2: fd 5 setting O_NONBLOCK\r\ndebug1: fd 5 clearing O_NONBLOCK\r\ndebug1: Connection established.\r\ndebug3: timeout: 9952 ms remain after connect\r\ndebug1: identity file /Users/vincent/.ssh/id_rsa type 0\r\ndebug1: identity file /Users/vincent/.ssh/id_rsa-cert type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_dsa type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_dsa-cert type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_ecdsa type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_ecdsa-cert type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_ed25519 type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_ed25519-cert type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_xmss type -1\r\ndebug1: identity file /Users/vincent/.ssh/id_xmss-cert type -1\r\ndebug1: Local version string SSH-2.0-OpenSSH_7.9\r\ndebug1: Remote protocol version 2.0, remote software version OpenSSH_7.4\r\ndebug1: match: OpenSSH_7.4 pat OpenSSH_7.0*,OpenSSH_7.1*,OpenSSH_7.2*,OpenSSH_7.3*,OpenSSH_7.4*,OpenSSH_7.5*,OpenSSH_7.6*,OpenSSH_7.7* compat 0x04000002\r\ndebug2: fd 5 setting O_NONBLOCK\r\ndebug1: Authenticating to serverc:22 as \'root\'\r\ndebug3: put_host_port: [serverc]:22\r\ndebug3: hostkeys_foreach: reading file "/Users/vincent/.ssh/known_hosts"\r\ndebug3: record_hostkey: found key type ECDSA in file /Users/vincent/.ssh/known_hosts:132\r\ndebug3: record_hostkey: found key type RSA in file /Users/vincent/.ssh/known_hosts:140\r\ndebug3: load_hostkeys: loaded 2 keys from [serverc]:22\r\ndebug3: order_hostkeyalgs: prefer hostkeyalgs: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,rsa-sha2-512,rsa-sha2-256,ssh-rsa\r\ndebug3: send packet: type 20\r\ndebug1: SSH2_MSG_KEXINIT sent\r\ndebug3: receive packet: type 20\r\ndebug1: SSH2_MSG_KEXINIT received\r\ndebug2: local client KEXINIT proposal\r\ndebug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c\r\ndebug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,rsa-sha2-512,rsa-sha2-256,ssh-rsa,ssh-ed25519-cert-v01@openssh.com,ssh-ed25519\r\ndebug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com\r\ndebug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com\r\ndebug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: compression ctos: zlib@openssh.com,zlib,none\r\ndebug2: compression stoc: zlib@openssh.com,zlib,none\r\ndebug2: languages ctos: \r\ndebug2: languages stoc: \r\ndebug2: first_kex_follows 0 \r\ndebug2: reserved 0 \r\ndebug2: peer server KEXINIT proposal\r\ndebug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: host key algorithms: ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519\r\ndebug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,aes128-cbc,aes192-cbc,aes256-cbc,blowfish-cbc,cast128-cbc,3des-cbc\r\ndebug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,aes128-cbc,aes192-cbc,aes256-cbc,blowfish-cbc,cast128-cbc,3des-cbc\r\ndebug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1\r\ndebug2: compression ctos: none,zlib@openssh.com\r\ndebug2: compression stoc: none,zlib@openssh.com\r\ndebug2: languages ctos: \r\ndebug2: languages stoc: \r\ndebug2: first_kex_follows 0 \r\ndebug2: reserved 0 \r\ndebug1: kex: algorithm: curve25519-sha256\r\ndebug1: kex: host key algorithm: ecdsa-sha2-nistp256\r\ndebug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com\r\ndebug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: zlib@openssh.com\r\ndebug3: send packet: type 30\r\ndebug1: expecting SSH2_MSG_KEX_ECDH_REPLY\r\ndebug3: receive packet: type 31\r\ndebug1: Server host key: ecdsa-sha2-nistp256 SHA256:1VVaMDmgjtuwqASxZCLIPZNkagDHT18VxPRoFYV0Wzw\r\ndebug3: put_host_port: [192.168.3.212]:22\r\ndebug3: put_host_port: [serverc]:22\r\ndebug3: hostkeys_foreach: reading file "/Users/vincent/.ssh/known_hosts"\r\ndebug3: record_hostkey: found key type ECDSA in file /Users/vincent/.ssh/known_hosts:132\r\ndebug3: record_hostkey: found key type RSA in file /Users/vincent/.ssh/known_hosts:140\r\ndebug3: load_hostkeys: loaded 2 keys from [serverc]:22\r\ndebug3: hostkeys_foreach: reading file "/Users/vincent/.ssh/known_hosts"\r\ndebug3: record_hostkey: found key type ECDSA in file /Users/vincent/.ssh/known_hosts:131\r\ndebug3: load_hostkeys: loaded 1 keys from [192.168.3.212]:22\r\ndebug1: Host \'[serverc]:22\' is known and matches the ECDSA host key.\r\ndebug1: Found key in /Users/vincent/.ssh/known_hosts:132\r\ndebug3: send packet: type 21\r\ndebug2: set_newkeys: mode 1\r\ndebug1: rekey after 134217728 blocks\r\ndebug1: SSH2_MSG_NEWKEYS sent\r\ndebug1: expecting SSH2_MSG_NEWKEYS\r\ndebug3: receive packet: type 21\r\ndebug1: SSH2_MSG_NEWKEYS received\r\ndebug2: set_newkeys: mode 0\r\ndebug1: rekey after 134217728 blocks\r\ndebug1: Will attempt key: /Users/vincent/.ssh/id_rsa RSA SHA256:jFJZZEt1K+N4p4fsX509tQbKNiQwEob+zg0whAJs+FA\r\ndebug1: Will attempt key: /Users/vincent/.ssh/id_dsa \r\ndebug1: Will attempt key: /Users/vincent/.ssh/id_ecdsa \r\ndebug1: Will attempt key: /Users/vincent/.ssh/id_ed25519 \r\ndebug1: Will attempt key: /Users/vincent/.ssh/id_xmss \r\ndebug2: pubkey_prepare: done\r\ndebug3: send packet: type 5\r\ndebug3: receive packet: type 7\r\ndebug1: SSH2_MSG_EXT_INFO received\r\ndebug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512>\r\ndebug3: receive packet: type 6\r\ndebug2: service_accept: ssh-userauth\r\ndebug1: SSH2_MSG_SERVICE_ACCEPT received\r\ndebug3: send packet: type 50\r\ndebug3: receive packet: type 51\r\ndebug1: Authentications that can continue: publickey,password\r\ndebug3: start over, passed a different list publickey,password\r\ndebug3: preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_lookup publickey\r\ndebug3: remaining preferred: ,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_is_enabled publickey\r\ndebug1: Next authentication method: publickey\r\ndebug1: Offering public key: /Users/vincent/.ssh/id_rsa RSA SHA256:jFJZZEt1K+N4p4fsX509tQbKNiQwEob+zg0whAJs+FA\r\ndebug3: send packet: type 50\r\ndebug2: we sent a publickey packet, wait for reply\r\ndebug3: receive packet: type 60\r\ndebug1: Server accepts key: /Users/vincent/.ssh/id_rsa RSA SHA256:jFJZZEt1K+N4p4fsX509tQbKNiQwEob+zg0whAJs+FA\r\ndebug3: sign_and_send_pubkey: RSA SHA256:jFJZZEt1K+N4p4fsX509tQbKNiQwEob+zg0whAJs+FA\r\ndebug3: sign_and_send_pubkey: signing using rsa-sha2-512\r\ndebug3: send packet: type 50\r\ndebug3: receive packet: type 52\r\ndebug1: Enabling compression at level 6.\r\ndebug1: Authentication succeeded (publickey).\r\nAuthenticated to serverc ([192.168.3.212]:22).\r\ndebug1: setting up multiplex master socket\r\ndebug3: muxserver_listen: temporary control path /Users/vincent/.ansible/cp/b17d87e4d6.C7BGXjv82Yt9mot2\r\ndebug2: fd 6 setting O_NONBLOCK\r\ndebug3: fd 6 is O_NONBLOCK\r\ndebug3: fd 6 is O_NONBLOCK\r\ndebug1: channel 0: new [/Users/vincent/.ansible/cp/b17d87e4d6]\r\ndebug3: muxserver_listen: mux listener channel 0 fd 6\r\ndebug2: fd 5 setting TCP_NODELAY\r\ndebug3: ssh_packet_set_tos: set IP_TOS 0x20\r\ndebug1: control_persist_detach: backgrounding master process\r\ndebug2: control_persist_detach: background process is 3567\r\ndebug2: fd 6 setting O_NONBLOCK\r\ndebug1: forking to background\r\ndebug1: Entering interactive session.\r\ndebug1: pledge: id\r\ndebug2: set_control_persist_exit_time: schedule exit in 60 seconds\r\ndebug1: multiplexing control connection\r\ndebug3: fd 3 is O_NONBLOCK\r\ndebug3: fd 3 is O_NONBLOCK\r\ndebug1: channel 1: new [mux-control]\r\ndebug3: channel_post_mux_listener: new mux channel 1 fd 3\r\ndebug3: mux_master_read_cb: channel 1: hello sent\r\ndebug2: set_control_persist_exit_time: cancel scheduled exit\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x00000001 len 4\r\ndebug2: mux_master_process_hello: channel 1 slave version 4\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x10000004 len 4\r\ndebug2: mux_master_process_alive_check: channel 1: alive check\r\ndebug3: mux_client_request_alive: done pid = 3569\r\ndebug3: mux_client_request_session: session request sent\r\ndebug3: mux_master_read_cb: channel 1 packet type 0x10000002 len 124\r\ndebug2: mux_master_process_new_session: channel 1: request tty 1, X 0, agent 0, subsys 0, term "xterm-256color", cmd "/bin/sh -c \'echo ~root && sleep 0\'", env 2\r\ndebug3: mux_master_process_new_session: got fds stdin 7, stdout 8, stderr 9\r\ndebug2: fd 8 setting O_NONBLOCK\r\ndebug2: fd 9 setting O_NONBLOCK\r\ndebug1: channel 2: new [client-session]\r\ndebug2: mux_master_process_new_session: channel_new: 2 linked to control channel 1\r\ndebug2: channel 2: send open\r\ndebug3: send packet: type 90\r\ndebug3: receive packet: type 80\r\ndebug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0\r\ndebug3: receive packet: type 91\r\ndebug2: channel_input_open_confirmation: channel 2: callback start\r\ndebug2: client_session2_setup: id 2\r\ndebug2: channel 2: request pty-req confirm 1\r\ndebug3: send packet: type 98\r\ndebug1: Sending environment.\r\ndebug1: Sending env LC_ALL = en_US.UTF-8\r\ndebug2: channel 2: request env confirm 0\r\ndebug3: send packet: type 98\r\ndebug1: Sending env LC_CTYPE = UTF-8\r\ndebug2: channel 2: request env confirm 0\r\ndebug3: send packet: type 98\r\ndebug1: Sending command: /bin/sh -c \'echo ~root && sleep 0\'\r\ndebug2: channel 2: request exec confirm 1\r\ndebug3: send packet: type 98\r\ndebug3: mux_session_confirm: sending success reply\r\ndebug2: channel_input_open_confirmation: channel 2: callback done\r\ndebug2: channel 2: open confirm rwindow 0 rmax 32768\r\ndebug3: receive packet: type 99\r\ndebug2: channel_input_status_confirm: type 99 id 2\r\ndebug2: PTY allocation request accepted on channel 2\r\ndebug2: channel 2: rcvd adjust 2097152\r\ndebug3: receive packet: type 99\r\ndebug2: channel_input_status_confirm: type 99 id 2\r\ndebug2: exec request accepted on channel 2\r\ndebug3: receive packet: type 98\r\ndebug1: client_input_channel_req: channel 2 rtype exit-status reply 0\r\ndebug3: mux_exit_message: channel 2: exit message, exitval 0\r\ndebug3: receive packet: type 98\r\ndebug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0\r\ndebug2: channel 2: rcvd eow\r\ndebug2: channel 2: chan_shutdown_read (i0 o0 sock -1 wfd 7 efd 9 [write])\r\ndebug2: channel 2: input open -> closed\r\ndebug3: receive packet: type 96\r\ndebug2: channel 2: rcvd eof\r\ndebug2: channel 2: output open -> drain\r\ndebug2: channel 2: obuf empty\r\ndebug2: channel 2: chan_shutdown_write (i3 o1 sock -1 wfd 8 efd 9 [write])\r\ndebug2: channel 2: output drain -> closed\r\ndebug3: receive packet: type 97\r\ndebug2: channel 2: rcvd close\r\ndebug3: channel 2: will not send data after close\r\ndebug2: channel 2: send close\r\ndebug3: send packet: type 97\r\ndebug2: channel 2: is dead\r\ndebug2: channel 2: gc: notify user\r\ndebug3: mux_master_session_cleanup_cb: entering for channel 2\r\ndebug2: channel 1: rcvd close\r\ndebug2: channel 1: output open -> drain\r\ndebug2: channel 1: chan_shutdown_read (i0 o1 sock 3 wfd 3 efd -1 [closed])\r\ndebug2: channel 1: input open -> closed\r\ndebug2: channel 2: gc: user detached\r\ndebug2: channel 2: is dead\r\ndebug2: channel 2: garbage collecting\r\ndebug1: channel 2: free: client-session, nchannels 3\r\ndebug3: channel 2: status: The following connections are open:\r\n  #1 mux-control (t16 nr0 i3/0 o1/16 e[closed]/0 fd 3/3/-1 sock 3 cc -1)\r\n  #2 client-session (t4 r0 i3/0 o3/0 e[write]/0 fd -1/-1/9 sock -1 cc -1)\r\n\r\ndebug2: channel 1: obuf empty\r\ndebug2: channel 1: chan_shutdown_write (i3 o1 sock 3 wfd 3 efd -1 [closed])\r\ndebug2: channel 1: output drain -> closed\r\ndebug2: channel 1: is dead (local)\r\ndebug2: channel 1: gc: notify user\r\ndebug3: mux_master_control_cleanup_cb: entering for channel 1\r\ndebug2: channel 1: gc: user detached\r\ndebug2: channel 1: is dead (local)\r\ndebug2: channel 1: garbage collecting\r\ndebug1: channel 1: free: mux-control, nchannels 2\r\ndebug3: channel 1: status: The following connections are open:\r\n  #1 mux-control (t16 nr0 i3/0 o3/0 e[closed]/0 fd 3/3/-1 sock 3 cc -1)\r\n\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to serverc closed.\r\ndebug2: set_control_persist_exit_time: schedule exit in 60 seconds\r\n')
<serverc> ESTABLISH SSH CONNECTION FOR USER: root
<serverc> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/Users/vincent/.ansible/cp/b17d87e4d6 serverc '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /root/.ansible/tmp/ansible-tmp-1561945122.158012-119072658004192 `" && echo ansible-tmp-1561945122.158012-119072658004192="` echo /root/.ansible/tmp/ansible-tmp-1561945122.158012-119072658004192 `" ) && sleep 0'"'"''
<serverc> (0, b'ansible-tmp-1561945122.158012-119072658004192=/root/.ansible/tmp/ansible-tmp-1561945122.158012-119072658004192\r\n', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data /Users/vincent/.ssh/config\r\ndebug1: /Users/vincent/.ssh/config line 135: Applying options for serverc\r\ndebug1: /Users/vincent/.ssh/config line 183: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 48: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 3569\r\ndebug3: mux_client_request_session: session request sent\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to serverc closed.\r\n')
<serverc> Attempting python interpreter discovery
<serverc> ESTABLISH SSH CONNECTION FOR USER: root
<serverc> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/Users/vincent/.ansible/cp/b17d87e4d6 serverc '/bin/sh -c '"'"'echo PLATFORM; uname; echo FOUND; command -v '"'"'"'"'"'"'"'"'/usr/bin/python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python3.5'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.7'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python2.6'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/libexec/platform-python'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'/usr/bin/python3'"'"'"'"'"'"'"'"'; command -v '"'"'"'"'"'"'"'"'python'"'"'"'"'"'"'"'"'; echo ENDFOUND && sleep 0'"'"''
<serverc> (0, b'PLATFORM\r\nLinux\r\nFOUND\r\n/usr/bin/python\r\n/usr/bin/python3.6\r\n/usr/bin/python2.7\r\n/usr/libexec/platform-python\r\n/usr/bin/python3\r\n/usr/bin/python\r\nENDFOUND\r\n', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data /Users/vincent/.ssh/config\r\ndebug1: /Users/vincent/.ssh/config line 135: Applying options for serverc\r\ndebug1: /Users/vincent/.ssh/config line 183: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 48: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 3569\r\ndebug3: mux_client_request_session: session request sent\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to serverc closed.\r\n')
<serverc> ESTABLISH SSH CONNECTION FOR USER: root
<serverc> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o ControlPath=/Users/vincent/.ansible/cp/b17d87e4d6 serverc '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''

nkakouros commented 5 years ago

Sth else that has happened to me is to have sth like:


- set_fact:
    linux_cmd: >-
      {{ executable }}
        --flag1
        --option1 argument

- command: "{{ linux_cmd }}"

When linux_cmd is given that way, the command module silently fails.

huzhijiang commented 5 years ago

It seems I have the same problem (version 2.4.2). I use jinja2 template module to copy a config file to 280 target hosts . For performance optimization， I change fork from 5 to 300. All subprocesses except one or two can finish successfully . Those subprocesses which can not finish are just stuck there without doing anything to the target hosts. And they cause the parent process dead loop and consum almost 90% CPU usage. I use strace with the parent and it say it is just running a lot of select() calls ...

Why the parent simply dead loop with select() without doing anything with the stuck subprocesses?

apigban commented 5 years ago

I use 2.8.5 and it gets stuck on gathering facts...

with verbose on, the final log is Escalation Succeeded.

infrascripting commented 5 years ago

IDK if it will help, but if you prepend ANSIBLE_DEBUG=1 to the execution, it will show even more verbose info, though I’m not sure it would reveal anything else helpful.

Regards,

Andrew Cziryak cziryaka@oclc.org Systems Analyst (614) 446-5653

From: Alain P Igban notifications@github.com Reply-To: ansible/ansible reply@reply.github.com Date: Wednesday, October 23, 2019 at 6:57 AM To: ansible/ansible ansible@noreply.github.com Cc: "Cziryak,Andrew" cziryaka@oclc.org, Mention mention@noreply.github.com Subject: [External] Re: [ansible/ansible] ansible hangs indefinitely waiting for command termination when SSH connection is OK (#18305)

I use 2.8.5 and it gets stuck on gathering facts...

with verbose on, the final log is Escalation Succeeded.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ansible/ansible/issues/18305?email_source=notifications&email_token=AG5F34PYEMYBOTFICUKZI23QQAUZJA5CNFSM4CU4WJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECA7JFA#issuecomment-545387668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG5F34JYLZEYBNGWSLKL74TQQAUZJANCNFSM4CU4WJRA.

emfalcon commented 5 years ago

I am having the same issue.

What I have done is created a new playbook to attempt a python hello world to see what happens.

playbook:

- hosts: all
  gather_facts: False
  tasks:
  - name: Test Hello_world
     hello_world:
     register: resule
   - debug: var=result

hello_world.py

from ansible.module_utils.basic import *
def main():
       module = AnsibleModule(argument_spec={})
       theReturnValue = {"hello": "world"}
       module.exit_json(changed=False, meta=theReturnValue)
if __name__ == ' __main__':
       main()

it hangs just as other above, opens the SSH connection, appears to be running the code, however it never appears to complete/close. I have manually ran the code on the remote_host and get returned value as expected.

Ansible version 2.8 - AWX 6.0.0
Remote Host: RHEL 6.5 Python 2.6

I also hang on gather_facts as well, and I wanted to determine if the problem was with the gather_facts or something else, I believe it is something else, and I'm thinking it's the ansible.module_utils.basic

emfalcon commented 5 years ago

update on mine, because I am using AWX I have found the problem is with "become" when I set become: false in the playbook it works perfect when I use become: true it hangs, I believe this is because it is prompting that with AWX will never make it back.

c5253458 commented 4 years ago

man -e 1 timeout

ghost commented 4 years ago

i have the same behavior randomly on any tasks, not often, but it's clearly a handicap to have no options to handle a timeout.

gsikorski commented 4 years ago

Some of the hangs may be related to https://sourceforge.net/p/sshpass/patches/11/. This can occur when running ansible playbooks on a heavy loaded system (ie. high fork count) and using sshpass with control persist feature for ssh (default for most).

mwtzzz-zz commented 4 years ago

In my case it was the following: hosts whose user .bashrc (the user ansible uses to log in) had a "sudo" to another user in it. This caused Ansible to hang indefinitely.

jimi-c commented 4 years ago

This may not be super helpful, but 2.10 will introduce a task timeout feature which may help in the case where this may pop up with troublesome commands.

mwtzzz-zz commented 4 years ago

This may not be super helpful, but 2.10 will introduce a task timeout feature which may help in the case where this may pop up with troublesome commands.

This should help in the case where the .bashrc/.bash_profile/.login has a sudo in it.

bcoca commented 4 years ago

Or any other case that causes an indefinite/long process, the problem is that there are too many reasons this can happen and many of them are not discoverable, much less solvable, remotely from the controller.

The new timeout keyword will help mitigate this issue, though it is not a fix, it at least gives users the option to continue the rest of the play if needed and abandon the execution after what they would consider a reasonable period.

infrascripting commented 4 years ago

I agree. This should help out tremendously. Thanks for the persistence with this issue!

Fairlight2cx commented 4 years ago

I was running into this, but only on two boxes out of 80-ish.

What I found was that facter was hanging. It was also consistently hanging (24 instances of it, from 24 attempts on one box) at a state where it was in a non-interruptible disk-wait state, and the last file descriptor open was /proc/partitions. It was only hanging on boxes where someone had tried to disable the floppy driver, but had done so incorrectly. It would also only hang in gather_facts if set to yes, and -only- if become was set to yes. (Either being 'no' would not result in a hang.

Blacklisting the floppy driver solved the problem for me. At least so far. I'm pretty sure that's what was wrong on the two systems which were plaguing me with hangs in facter. Hard to tell 100%, but the diagnostics were the same on both, and they've both been fine since being rectified. However, a reboot has been known to fix this issue in the past, so I'm not 100% convinced, and won't be at ease for a while.

jfoliveira commented 4 years ago

In my case Ansible was hanging forever (I guess, I only waited for one hour) to execute a command that only takes 2 seconds when executed manually.

The reason:

a playbook invoked with root having a task using shell module and becoming a less privileged user.


tasks:
- name: Non-root task running a quick node.js script
shell: "yarn my-script"
args:
  executable: /bin/bash
  chdir: "target-dir"
become: yes
become_user: "my-non-root-user"

- name: This task needs ROOT access that's why playbook is invoked with root user
shell: "my-executable-shell-script.sh"
args:
  executable: /bin/bash
  chdir: "target-dir"



I need to do this so the files touched/created by the task are owned by the actual user instead of being owned by `root`

**The workaround:**
- Running the playbook with the user account that should own the file to be created instead of invoking the playbook using the `root` user make the task pass. But, as other tasks in the same playbook (that should be executed as part of the same playbook) require `root` access, then the  next tasks fail due to lack of privilege;

**Questions:**

1. Is `"de-escalating"` (sorry, couldn't think of a better word for that) from `root` to a regular user account a bad practice? Why? If so, could Ansible raise a warning and tell users about it?
2. Other [comments on this same thread](#issuecomment-550430394) points to the usage of `become` as the root cause. As `Ansible 2.10`will tackle this issue by adding the `timeout` keyword., would it be possible to the error output of a timed-out task also point to issues related to [in]correct(?) usage of `become`?

bcoca commented 4 years ago

@Fairlight2cx so that is under generic 'fact gathering device issues' family of hangs, fact gathering tries to list current devices, many skipped as only root is allowed to query certain information.

@jfoliveira In your case it happened just in that circumstance. But this still can happen under a normal user with a device or any other blocking call they have access to query and can be 'broken' or 'stuck' when you do so. My favorite example ls /badnfs.

This is why setting up a automatic timeout just for become tasks would not be advisable ... also there are many cases in which a task is supposed to take a long time and this would require users to rewrite their playbooks to account for it.

In the end, this type of hang always requires a skilled admin to debug the cause as there are too many possible reasons and they all require different solutions.

jfoliveira commented 4 years ago

Thanks Brian @bcoca ! I got the issue with long running processes and the timeout feature will definitely be helpful. What I'm not getting though is this very playbook/shell command only hanging when executed through Ansible as root with become_user: "my-non-root-user-name" set in my playbook.

Examples:

Works - playbook finishes in a few seconds, as expected, and STDERR of the failing shell command is displayed: ansible-playbook my-playbook.yml --user=my-non-root-user-name
Hangs "forever" - I waited up to one hour, several times: ansible-playbook my-playbook.yml --user=root

I'm giving lots of details below. Please let me know if this doesn't fit here, if I should submit another issue for that, and what extra info I can provide in order to help on a fix for it.

Running the command yarn build manually in the target machine it fails in 12 (twelve) seconds - as expected as there is an actual error in my Node.js code that I set on purpose to ensure my playbook can report me the STDERR output;
Running through Ansible as root I ran several times and waited up to 60 minutes and the playbook never finishes/fails;
Monitoring the process in the target host with ps aux or top commands, it seems Ansible-as-root is spawning a new process when the first one fails, and spawning a third process when the second one fails, ..., as there are many running process executing the same yarn build command;
Logs with ANSIBLE_DEBUG or -vvvvvvv are not telling anything other than regular SSH connection details;
I can't state Ansible is spawning new process. All I'm sure is that only when running the shell command through the playbook with Ansible-as-root-user I see many processes on the target machine with the same shell command;

Question: Is there any Ansible implicit setting that is forcing a keep retrying when process fails when executing the playbook as root and becoming another user? I'd love to disable it.

Here is the playbook code:

---
- hosts: all
  gather_facts: no
  become: yes
  become_user: "my-non-root-user-name"

  tasks:
  - name: Non-root task running a quick failing node.js command
    shell: "yarn build"
    args:
      executable: /bin/bash
      chdir: "target-dir"

Environment details that might be helpful:

OS: Ubuntu 18.04 LTS in both Control and Node/Target machines

Ansible - Version:

$ ansible-playbook --version
ansible-playbook 2.9.9
  config file = /home/user/playbooks/ansible.cfg
  configured module search path = [u'/home/user/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]

Ansible - Configuration:

$ ansible-config dump --only-changed 
ANSIBLE_NOCOWS(/home/user/playbooks/ansible.cfg) = True
ANSIBLE_PIPELINING(/home/user/playbooks/ansible.cfg) = True
ANSIBLE_SSH_ARGS(/home/user/playbooks/ansible.cfg) = -C -o ControlPersist=60s -o PreferredAuthentications=publickey -o ControlMaster=auto -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
DEFAULT_CALLBACK_WHITELIST(/home/user/playbooks/ansible.cfg) = [u'json', u'timer']
DEFAULT_MANAGED_STR(/home/user/playbooks/ansible.cfg) = Project.com - File generated through http://project.com with
DEFAULT_ROLES_PATH(/home/user/playbooks/ansible.cfg) = [u'/home/user/user/playbooks/roles', u'/etc/ansible/roles', u'/home/user/.ansible/roles']
DEFAULT_STDOUT_CALLBACK(/home/user/playbooks/ansible.cfg) = debug
HOST_KEY_CHECKING(/home/user/playbooks/ansible.cfg) = False
INTERPRETER_PYTHON(/home/user/playbooks/ansible.cfg) = /usr/bin/python3
RETRY_FILES_ENABLED(/home/user/playbooks/ansible.cfg) = False

bcoca commented 4 years ago

no, there is no implicit 'process retry' .. that is what until/retry is for, the ssh connection does have a 'connection retry' but that would not show on the remote if failing/retrying nor create new processes.

bcoca commented 4 years ago

added new timeout keyword in 2.10 (#69284 ) which acts as mitigation of these conditions. Ansible still cannot know if the remote is hung or just waiting for something to finish ,but the playbook author can now set a timeout that in their context will create a limit between both cases.

MorphBonehunter commented 4 years ago

I also realize hang behavior after upgrading to 2.10. so first i thought it's something with the new version. I've downgrade to 2.9.13 and was surprised that this issue persists.

On every run it hangs on an different system and different task. So i've dig a little bit further and did some straces and i can see on the target system(s) the processes reading on stdin which is an pipe and goes up to the ssh connection waiting on data from master. So i guess it has something to do with the forks and reduce the forks from 5 to 1 and the playbooks completes (very slow 😄) ) without problems. Maybe this is an race condition or something like that and related to #59642?

In case of fork > 1, if i kill the ssh session i got different "exceptions":

fatal: [pandora]: UNREACHABLE! => {"changed": false, "msg": "Failed to create temporary directory.In some cases, you may have been able to authenticate and did not have permissions on the target directory. Consider changing the remote tmp path in ansible.cfg to a path rooted in \"/tmp\", for more error information use -vvv. Failed command was: ( umask 77 && mkdir -p \"` echo /tmp/.ansible `\"&& mkdir \"` echo /tmp/.ansible/ansible-tmp-1601029060.3310869-54420-202820423202194 `\" && echo ansible-tmp-1601029060.3310869-54420-202820423202194=\"` echo /tmp/.ansible/ansible-tmp-1601029060.3310869-54420-202820423202194 `\" ), exited with result -15", "unreachable": true}
fatal: [charon]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ", "unreachable": true}

After that the remaining playbook with the remaining hosts are complete.

OS: Arch Linux on controller + targets

Ansible Version:

ansible 2.10.1
  config file = /home/daniel/repos/ansible/ansible.cfg
  configured module search path = ['/home/daniel/repos/ansible/library', '/home/daniel/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.8/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.8.5 (default, Sep  5 2020, 10:50:12) [GCC 10.2.0]

Ansible - Configuration:

ansible-config dump --only-changed
[DEPRECATION WARNING]: DEFAULT_HASH_BEHAVIOUR option, This feature is fragile and not portable, leading to continual confusion and misuse , use the ``combine`` filter explicitly instead. This feature will be removed in version 2.13. 
Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
ANSIBLE_NOCOWS(/home/daniel/repos/ansible/ansible.cfg) = True
ANSIBLE_PIPELINING(/home/daniel/repos/ansible/ansible.cfg) = True
CACHE_PLUGIN(/home/daniel/repos/ansible/ansible.cfg) = jsonfile
CACHE_PLUGIN_CONNECTION(/home/daniel/repos/ansible/ansible.cfg) = cache/
CACHE_PLUGIN_TIMEOUT(/home/daniel/repos/ansible/ansible.cfg) = 900
DEFAULT_CALLBACK_WHITELIST(/home/daniel/repos/ansible/ansible.cfg) = ['profile_tasks']
DEFAULT_FORKS(/home/daniel/repos/ansible/ansible.cfg) = 1
DEFAULT_HASH_BEHAVIOUR(/home/daniel/repos/ansible/ansible.cfg) = merge
DEFAULT_HOST_LIST(/home/daniel/repos/ansible/ansible.cfg) = ['/home/daniel/repos/ansible/hosts']
DEFAULT_LOCAL_TMP(/home/daniel/repos/ansible/ansible.cfg) = /tmp/.ansible/ansible-local-48040jhby95fe
DEFAULT_MODULE_PATH(/home/daniel/repos/ansible/ansible.cfg) = ['/home/daniel/repos/ansible/library', '/home/daniel/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
DEFAULT_NO_TARGET_SYSLOG(/home/daniel/repos/ansible/ansible.cfg) = True
DEFAULT_TIMEOUT(/home/daniel/repos/ansible/ansible.cfg) = 5
DEFAULT_VAULT_PASSWORD_FILE(env: ANSIBLE_VAULT_PASSWORD_FILE) = /home/daniel/.tokens/.ansible_vault_token
INTERPRETER_PYTHON(/home/daniel/repos/ansible/ansible.cfg) = auto_silent
RETRY_FILES_ENABLED(/home/daniel/repos/ansible/ansible.cfg) = False

(yes i know DEFAULT_HASH_BEHAVIOUR is deprecated, but i come from saltstack and use my pillars 1:1 as host/group vars so it hurts to rewrite all the stuff 😱)

rvalle commented 3 years ago

I am also experiencing this issue.

In my case the tasks are not long, but the playbook executions are (around 40 minutes): the use case is using ansible to run CI tests.

I dont target a lot of servers, but just 3.

IMHO the hangs are not completely aleatory but only pseudo-aleatory. For example It has been 3 or 4 hangs in a row on the same task.

The actual task is checking that a list with a few packages are installed (debian/apt) and they are installed, so it returns without changes in 2 of the 3 hosts.

Is very curious that this concrete task is stopping the execution and it is curios to see that on each execution the host where it happens changes. but the task is the same.

I suspect timings (because of the predictability), parallelisation (because of the fact that the host changes).

Also another thing is that my playbooks reboot the target machines 2 or 3 times. The failure is 2 or 3 tasks after a reboot, right now.

I hope this information helps to figure out what could it be.

tristondvs commented 3 years ago

My use case involves speaking to a large number of remote hosts (serialized to 20-25 hosts at a time) that can't be touched during business hours, so my playbook is called from a cron task and uses the slack module to send me updates for which hosts have been updated. The integrated webhook allows me to check on the status of these updates from my phone and allows me to work from home/afk much more easily :)

Something interesting I have noticed when this ssh timeout issue occurs; ansible seemingly process for each host it speaks to during any particular task. The host that it is stuck on will be listed as an active process on my ansible controller with the inventory hostname appended to it (Ex. ps aux | grep hostname.test.com)

In my investigations so far, this process hangs for days at a time, but does not fail or timeout the ansible-playbook itself. I can manually kill the created process for the offending host on my ansible controller, and the playbook will continue as normal (while failing the task for the offending host).

My current plan is to determine an effective way of terminating these frozen processes when they are taking too long, as I have a good feel for how long each serialized run should take, which will then allow the playbook to continue. Since I have the privilege of searching for a particular string found in each instance of my inventory hostnames, I can locate the processes specific to the ansible hosts themselves and determine the pid from that. Another thing worth noting is typically I can attempt the same offending host a day or two later and find that the issue does not occur, so any host this occurs on can be attempted at a later time.

I think I can speak for everyone by saying I appreciate the info and input that everyone has added so far. I hope this information helps you in some way to determine a good work-around for this issue in your use case.

regloff72 commented 3 years ago

I'm having this issue also and it seems to come up on Solaris servers moreso than our Oracle Linux Servers. I've even bumped the timeout way up:

"msg": "Timeout (62s) waiting for privilege escalation prompt: "

It's not consistent either. Sometimes it'll work just fine across all hosts - other times a number will fail. I do have some larger playbooks running and in those I have told it not to gather facts - that seems to help with this a great deal as well. I'm on a new project with Ansible, so I'm kind of looking over this issue again. The environment I'm currently working on this project for isn't a prod environment, so I have a bit more room to 'test'.

I know it's not the playbook.. I have one specifically for testing connections:

---
- hosts: all
  become: yes
  tasks:

  - name:
    debug:
      var: hostvars[inventory_hostname].ansible_hostname

It couldn't get more simple I suppose. Thing is - when I login using the Ansible account directly and use sudo su -, it works perfectly everytime. -c paramiko does seem to mitigate it to a large degree, but not always.

I did happen to notice though in the 'su-log' I found a couple entries similar to this.

sudo: [ID 183074 local2.notice] ansible : TTY=unknown

In most other cases, I get an actual number for the TTY session: sudo: [ID 183074 local2.notice] ansible : TTY=pts/2

Not all of them were that way. I wasn't tracking the exact time I got the sudo error, but I'm running that playbook in a loop on one of the servers that errored out to see if the 'TTY=unknown' coincides with a sudo failure.

Also - this is Solaris 5.11 11.4.26.75.4 - using sudo version 1.8.31.1-11.4.24.0.1.75.1

Edit/update: the Unknown TTY doesn't seem to appear on the most recent failures I ran into.

bcoca commented 3 years ago

sudo su - is not really supported by the sudo become method.

ansible / ansible

ansible hangs indefinitely waiting for command termination when SSH connection is OK #18305

ISSUE TYPE

COMPONENT NAME

ANSIBLE VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

!/bin/bash

EXPECTED RESULTS

ACTUAL RESULTS

EXPECTED RESULTS

ACTUAL RESULTS

Regards,