Open tchernomax opened 8 months ago
Yep - I'm seeing the same problem. How is the magic number of 5
arrived at? The user-provided value self.config.pexpect_timeout
is available to use
Looking a little deeper at this, it seems that initialisation of the Runner
ensures that pexpect.TIMEOUT
and pexpect.EOF
are automatically added, thus preventing them from raising an exception: https://github.com/ansible/ansible-runner/blob/devel/src/ansible_runner/config/_base.py#L205-206
Should the same be done on the new flush instruction to prevent a Timeout exception being raised? https://github.com/ansible/ansible-runner/blob/devel/src/ansible_runner/runner.py#L350
The configurable timeout is only there to control how long it'll hang out looking for a prompt to respond to before rolling to the next loop iteration (and allowing other timeouts to be evaluated). The new final stream pump is only there to ensure we don't lose data that pexpect has already buffered from the already-exited child process (previously, any data that pexpect buffered between the last poll and reporting the child exit was dropped on the floor). 5s is already a relative eternity to wait for buffered data in that situation- the initial bug was 100% fixed in all our reproducers by setting the timeout to 0.01s. If it's still not there after 5s, it's likely either a bug in pexpect or the underlying platform stdio stream handling (I've personally encountered both)- waiting longer is unlikely to change anything. It'd probably make more sense to just eat the exception from that final buffer scrape and issue a warning that data could've been lost. I'm curious how often you're encountering this, and if it's specific to any particular distro/container runtime/whatever.
@tchernomax If you've got a repro for a job that always fails, I'd love to see it- IIUC even a 100% empty stream should still yield EOF, but there are a lot of moving parts in there between the OS/console stream layers, Python, and pexpect, and this is definitely one of those areas where different kernels/distros/OSs/versions don't all behave in exactly the same way.
The configurable timeout is only there to control how long it'll hang out looking for a prompt to respond to before rolling to the next loop iteration (and allowing other timeouts to be evaluated). The new final stream pump is only there to ensure we don't lose data that pexpect has already buffered from the already-exited child process (previously, any data that pexpect buffered between the last poll and reporting the child exit was dropped on the floor). 5s is already a relative eternity to wait for buffered data in that situation- the initial bug was 100% fixed in all our reproducers by setting the timeout to 0.01s. If it's still not there after 5s, it's likely either a bug in pexpect or the underlying platform stdio stream handling (I've personally encountered both)- waiting longer is unlikely to change anything. It'd probably make more sense to just eat the exception from that final buffer scrape and issue a warning that data could've been lost. I'm curious how often you're encountering this, and if it's specific to any particular distro/container runtime/whatever.
Thanks for replying. FWIW, this is in a container based on Alpine 3.18 with Python 3.11.4. The host OS is RHEL 8.
I'm having to spin-up on pexpect
and the way stdout is hooked to get the CLI output so I may be off the mark, here - would there be any unintended consequences if the user's code in an event_handler
callback took many seconds to complete?
I've got a repro for this issue - no container required. A combination of the number of hosts in the inventory (some of which are unreachable), and the time spent in the event_handler
callback causes the TIMEOUT.
Code
import ansible_runner
from time import sleep
def my_cancel_handler():
print("CANCELLATION")
return False
def my_artifacts_handler(artifacts_dir):
print(artifacts_dir)
def my_status_handler(data, runner_config):
print(f"STATUS - config: {runner_config} data: {data}")
def my_event_handler(data):
print(f"EVENT - {data}")
sleep(2)
return True
def my_finished_handler(runner):
print(f"FINISH")
print(f"\tSTATS: {runner.stats}")
r = ansible_runner.run(
private_data_dir="./private_data_dir",
playbook="playbook.yml",
cancel_callback=my_cancel_handler,
artifacts_handler=my_artifacts_handler,
status_handler=my_status_handler,
event_handler=my_event_handler,
finished_callback=my_finished_handler,
settings={"suppress_ansible_output": False},
timeout=1200,
ignore_logging=False,
debug=True)
Inventory I've redacted my host names:
[all]
HOST_1
HOST_2
HOST_3
HOST_4
HOST_5
HOST_6
HOST_7
HOST_8
HOST_9
UNREACHABLE_HOST_1
UNREACHABLE_HOST_2
UNREACHABLE_HOST_3
UNREACHABLE_HOST_4
UNREACHABLE_HOST_5
UNREACHABLE_HOST_6
UNREACHABLE_HOST_7
UNREACHABLE_HOST_8
UNREACHABLE_HOST_9
UNREACHABLE_HOST_10
UNREACHABLE_HOST_11
UNREACHABLE_HOST_12
Playbook
---
- name: breaker
hosts: all
gather_facts: no
tasks:
- name: Ping stuff
ansible.builtin.ping:
Error I've mangled this slightly to use the redacted host names (note that the first character of the hostname is stripped - this is not part of the redaction):
pexpect.exceptions.TIMEOUT: <pexpect.pty_spawn.spawn object at 0x7fd6fa2d12d0>
command: /home/me/.local/bin/ansible-playbook
args: [b'/home/me/.local/bin/ansible-playbook', b'-i', b'/home/me/runner_recker/private_data_dir/inventory', b'playbook.yml']
buffer (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
before (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: 4
flag_eof: False
pid: 72622
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: <ansible_runner.utils.OutputEventFilter object at 0x7fd6fa2c8610>
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: EOF
<pexpect.pty_spawn.spawn object at 0x7fd6fa2d12d0>
command: /home/me/.local/bin/ansible-playbook
args: [b'/home/me/.local/bin/ansible-playbook', b'-i', b'/home/me/runner_recker/private_data_dir/inventory', b'playbook.yml']
buffer (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
before (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: 4
flag_eof: False
pid: 72622
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: <ansible_runner.utils.OutputEventFilter object at 0x7fd6fa2c8610>
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: EOF
Aha, a slow synchronous event handler blocking the stream consumption... Thanks for the repro- that clarified the issue nicely!
We're generally only using really lightweight event handlers that unconditionally persist to disk for later analysis, so even with the rare case of a lot of data in that final buffer, the overhead is pretty negligible.
This is one of those unfortunate problems with having pexpect wrapped up in the stream handling even when we're 100% sure it's no longer needed. We've got some longer-term ideas to eliminate it entirely, but they rely on broader architectural changes to Ansible Core that are still in the formative stages.
IMO this problem was arguably worse before the new stream drain fix, since those final events were always silently lost- now at least you know there was a problem (where before you just got a truncated event stream). The simple exception catch + warning I proposed to "fix" this though is almost certainly not the right thing. We'll need to balance some accommodation for long-running/blocking event handlers with the need to respect other timeouts properly. A proper fix will probably loop-poll for EOF and consult the stream position to see if it moved between calls and/or track in-flight event handlers before deciding whether to give up or reissue another poll.
Meantime, just spitballing, but is there any reason your long-running event handler needs to be synchronous with the input pipeline (e.g., does the long-running part affect a filtering decision?), or could you offload it to a threadpool or other mechanism that wouldn't block the pump?
We've already moved the event handler logic to another thread and the problem has gone. Nothing like a slow-consumer problem to remind you why you need to be quick on a callback. Happy days!.
Hello
We are running a lot of CI with ansible-runner and since we move to ansible_runner 2.3.5, on some of them we experience this:
Some CI jobs works (with ansible_runner 2.3.5), some don't (also with ansible_runner 2.3.5), and those who don't always fail (a relaunch doesn't change anything).
I think the problem comes from the fact that child.expect(pexpect.EOF, timeout=5) never encounter
pexpect.EOF
in thechild
output.It might be because the pexpect.spawn fails so child.expect(pexpect.EOF, timeout=5) has no output to find
EOF
(just guessing, I don't know much aboutpexpect
).Regards
Thanks