ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.12k stars 3.44k forks source link

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) #15104

Open akshat87 opened 7 months ago

akshat87 commented 7 months ago

Please confirm the following

Bug Summary

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

It should wait for workflow to complete in ansible tower but rather the remote connection is closed.

AWX version

24.2.0

Select the relevant components

Installation method

N/A

Modifications

no

Ansible version

Ansible Automation Platform Controller 4.3.6

Operating system

redhat linux 8

Web browser

No response

Steps to reproduce

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Expected results

It should wait for workflow to complete in ansible tower but rather the remote connection is closed.

Actual results

awx workflow_job_templates launch --wait command fails with ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Additional information

No response

fosterseth commented 7 months ago

do you see the jobs finish in the UI? how long do these workflows run, and how long did the cli command wait before returning the RemoteDisconnected error?

XakV commented 7 months ago

I've seen similar in AWX 23.3.1. The template involves an ansible.builtin.uri call to VMware orchestrator, followed by an ansible.builtin.wait_for_connection. The job log pauses here:

Using module file /usr/local/lib/python3.11/site-packages/ansible/modules/ping.py
Pipelining is enabled.
<host.fqdn> ESTABLISH SSH CONNECTION FOR USER: $ANSIBLE_REMOTE_USER
<host.fqdn> SSH: EXEC ssh -vvv -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="$ANSIBLE_REMOTE_USER"' -o ConnectTimeout=120 -o 'ControlPath="/tmp/ansible-root-%h"' host.fqdn '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
wait_for_connection: attempting ping module test
sending connection check: [b'ssh', b'-vvv', b'-o', b'ServerAliveInterval=30', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60', b'-o', b'StrictHostKeyChecking=no', b'-o', b'UserKnownHostsFile=/dev/null', b'-o', b'StrictHostKeyChecking=no', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="$ANSIBLE_REMOTE_USER"', b'-o', b'ConnectTimeout=120', b'-o', b'ControlPath="/tmp/ansible-root-%h"', b'-O', b'check', b'host.fqdn']

While the job log in the WebUI hangs here, the awx-task-runner-blah-blah repeatedly ( about every second or so ) attempts to make a connection.

You can watch the connection attempts by opening a shell in the task-runner container and identifying the parent ansible process/thread ID and then inferring the PID/TID from the active children, essentially ls -l /proc and if you suspected the child PID to be in the range of 200 to 399, while true; do cat /proc/[2,3]*/cmdline; done.

I can provide additional info if needed. The AWX install lives on a Rancher cluster running k8s v1.24.17 on rhel 7 hosts. Ingress is nginx, networking is Canal, pvc provided by portworx.

XakV commented 7 months ago

Adding relevant bits of our ansible.cfg

defaults]

home = .ansible
roles_path    = roles
playbook_dir = playbooks
transport = smart
collections_path = .ansible/collections:/usr/share/ansible/collections:.venv/lib/python3.11/site-packages/ansible_collections/
remote_user = $ANSIBLE_REMOTE_USER
remote_tmp     = /tmp/$USER/.ansible
gather_subset = all
interpreter_python = auto
host_key_checking = False
timeout = 120
verbosity = 1
module_name = shell
ansible_managed = Ansible managed: {file} modified on %Y-%m-%d %H:%M by root on {host}
system_warnings = True
deprecation_warnings = True
command_warnings = False
callbacks_enabled = ansible.posix.profile_tasks
stdout_callback = yaml
display_skipped_hosts = False
retry_files_enabled = False
var_compression_level = 9
jinja2_extensions = jinja2.ext.do

[callback_profile_tasks]
task_output_limit = 5

[inventory]
enable_plugins=ansible.builtin.constructed, host_list, script, auto, yaml, ini, toml

[privilege_escalation]
become_ask_pass=False
become_method=sudo
become_flags="-iS"

[ssh_connection]

ssh_args = -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
control_path = /tmp/ansible-root-%%h
pipelining = True
transfer_method = smart

[persistent_connection]

connect_timeout = 30
connect_retries = 30
connect_interval = 1

Note that I'm substituting $ANSIBLE_REMOTE_USER for the actual user name.

XakV commented 7 months ago

Found the error that ended the task above.

{"log":"2024-04-26 15:10:21,352 INFO [c3b7da2d511940cd9f42ad53edf60a96] awx.main.scheduler Workflow job 29241 failed due to reason: No error handling path for workflow job node(s) [(4838,error)]. Workflow job node(s) missing unified job template and error handling path [].\n","stream":"stderr","time":"2024-04-26T15:10:21.353827271Z"}