Open akshat87 opened 7 months ago
do you see the jobs finish in the UI? how long do these workflows run, and how long did the cli command wait before returning the RemoteDisconnected error?
I've seen similar in AWX 23.3.1. The template involves an ansible.builtin.uri
call to VMware orchestrator, followed by an ansible.builtin.wait_for_connection
. The job log pauses here:
Using module file /usr/local/lib/python3.11/site-packages/ansible/modules/ping.py
Pipelining is enabled.
<host.fqdn> ESTABLISH SSH CONNECTION FOR USER: $ANSIBLE_REMOTE_USER
<host.fqdn> SSH: EXEC ssh -vvv -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="$ANSIBLE_REMOTE_USER"' -o ConnectTimeout=120 -o 'ControlPath="/tmp/ansible-root-%h"' host.fqdn '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
wait_for_connection: attempting ping module test
sending connection check: [b'ssh', b'-vvv', b'-o', b'ServerAliveInterval=30', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60', b'-o', b'StrictHostKeyChecking=no', b'-o', b'UserKnownHostsFile=/dev/null', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="$ANSIBLE_REMOTE_USER"', b'-o', b'ConnectTimeout=120', b'-o', b'ControlPath="/tmp/ansible-root-%h"', b'-O', b'check', b'host.fqdn']
While the job log in the WebUI hangs here, the awx-task-runner-blah-blah
repeatedly ( about every second or so ) attempts to make a connection.
You can watch the connection attempts by opening a shell in the task-runner container and identifying the parent ansible process/thread ID and then inferring the PID/TID from the active children, essentially ls -l /proc
and if you suspected the child PID to be in the range of 200 to 399, while true; do cat /proc/[2,3]*/cmdline; done
.
I can provide additional info if needed. The AWX install lives on a Rancher cluster running k8s v1.24.17 on rhel 7 hosts. Ingress is nginx, networking is Canal, pvc provided by portworx.
Adding relevant bits of our ansible.cfg
defaults]
home = .ansible
roles_path = roles
playbook_dir = playbooks
transport = smart
collections_path = .ansible/collections:/usr/share/ansible/collections:.venv/lib/python3.11/site-packages/ansible_collections/
remote_user = $ANSIBLE_REMOTE_USER
remote_tmp = /tmp/$USER/.ansible
gather_subset = all
interpreter_python = auto
host_key_checking = False
timeout = 120
verbosity = 1
module_name = shell
ansible_managed = Ansible managed: {file} modified on %Y-%m-%d %H:%M by root on {host}
system_warnings = True
deprecation_warnings = True
command_warnings = False
callbacks_enabled = ansible.posix.profile_tasks
stdout_callback = yaml
display_skipped_hosts = False
retry_files_enabled = False
var_compression_level = 9
jinja2_extensions = jinja2.ext.do
[callback_profile_tasks]
task_output_limit = 5
[inventory]
enable_plugins=ansible.builtin.constructed, host_list, script, auto, yaml, ini, toml
[privilege_escalation]
become_ask_pass=False
become_method=sudo
become_flags="-iS"
[ssh_connection]
ssh_args = -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
control_path = /tmp/ansible-root-%%h
pipelining = True
transfer_method = smart
[persistent_connection]
connect_timeout = 30
connect_retries = 30
connect_interval = 1
Note that I'm substituting $ANSIBLE_REMOTE_USER for the actual user name.
Found the error that ended the task above.
{"log":"2024-04-26 15:10:21,352 INFO [c3b7da2d511940cd9f42ad53edf60a96] awx.main.scheduler Workflow job 29241 failed due to reason: No error handling path for workflow job node(s) [(4838,error)]. Workflow job node(s) missing unified job template and error handling path [].\n","stream":"stderr","time":"2024-04-26T15:10:21.353827271Z"}
Please confirm the following
security@ansible.com
instead.)Bug Summary
awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
It should wait for workflow to complete in ansible tower but rather the remote connection is closed.
AWX version
24.2.0
Select the relevant components
Installation method
N/A
Modifications
no
Ansible version
Ansible Automation Platform Controller 4.3.6
Operating system
redhat linux 8
Web browser
No response
Steps to reproduce
awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Expected results
It should wait for workflow to complete in ansible tower but rather the remote connection is closed.
Actual results
awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Additional information
No response