`pexpect.exceptions.TIMEOUT` due to `child.expect(pexpect.EOF, timeout=5)`

tchernomax commented 8 months ago

Hello

We are running a lot of CI with ansible-runner and since we move to ansible_runner 2.3.5, on some of them we experience this:

  File "/usr/local/lib/python3.11/dist-packages/ansible_runner/__main__.py", line 874, in main
    res = run(**run_options)
          ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ansible_runner/interface.py", line 210, in run
    r.run()
  File "/usr/local/lib/python3.11/dist-packages/ansible_runner/runner.py", line 348, in run
    child.expect(pexpect.EOF, timeout=5)
  File "/usr/local/lib/python3.11/dist-packages/pexpect/spawnbase.py", line 354, in expect
    return self.expect_list(compiled_pattern_list,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pexpect/spawnbase.py", line 383, in expect_list
    return exp.expect_loop(timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pexpect/expect.py", line 181, in expect_loop
    return self.timeout(e)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pexpect/expect.py", line 144, in timeout
    raise exc
pexpect.exceptions.TIMEOUT: <pexpect.pty_spawn.spawn object at 0x7f10538a7c90>

Some CI jobs works (with ansible_runner 2.3.5), some don't (also with ansible_runner 2.3.5), and those who don't always fail (a relaunch doesn't change anything).

I think the problem comes from the fact that child.expect(pexpect.EOF, timeout=5) never encounter pexpect.EOF in the child output.

It might be because the pexpect.spawn fails so child.expect(pexpect.EOF, timeout=5) has no output to find EOF (just guessing, I don't know much about pexpect).

Regards

Thanks

alexchowle commented 5 months ago

Yep - I'm seeing the same problem. How is the magic number of 5 arrived at? The user-provided value self.config.pexpect_timeout is available to use

alexchowle commented 5 months ago

Looking a little deeper at this, it seems that initialisation of the Runner ensures that pexpect.TIMEOUT and pexpect.EOF are automatically added, thus preventing them from raising an exception: https://github.com/ansible/ansible-runner/blob/devel/src/ansible_runner/config/_base.py#L205-206

Should the same be done on the new flush instruction to prevent a Timeout exception being raised? https://github.com/ansible/ansible-runner/blob/devel/src/ansible_runner/runner.py#L350

nitzmahone commented 5 months ago

The configurable timeout is only there to control how long it'll hang out looking for a prompt to respond to before rolling to the next loop iteration (and allowing other timeouts to be evaluated). The new final stream pump is only there to ensure we don't lose data that pexpect has already buffered from the already-exited child process (previously, any data that pexpect buffered between the last poll and reporting the child exit was dropped on the floor). 5s is already a relative eternity to wait for buffered data in that situation- the initial bug was 100% fixed in all our reproducers by setting the timeout to 0.01s. If it's still not there after 5s, it's likely either a bug in pexpect or the underlying platform stdio stream handling (I've personally encountered both)- waiting longer is unlikely to change anything. It'd probably make more sense to just eat the exception from that final buffer scrape and issue a warning that data could've been lost. I'm curious how often you're encountering this, and if it's specific to any particular distro/container runtime/whatever.

nitzmahone commented 5 months ago

@tchernomax If you've got a repro for a job that always fails, I'd love to see it- IIUC even a 100% empty stream should still yield EOF, but there are a lot of moving parts in there between the OS/console stream layers, Python, and pexpect, and this is definitely one of those areas where different kernels/distros/OSs/versions don't all behave in exactly the same way.

alexchowle commented 5 months ago

The configurable timeout is only there to control how long it'll hang out looking for a prompt to respond to before rolling to the next loop iteration (and allowing other timeouts to be evaluated). The new final stream pump is only there to ensure we don't lose data that pexpect has already buffered from the already-exited child process (previously, any data that pexpect buffered between the last poll and reporting the child exit was dropped on the floor). 5s is already a relative eternity to wait for buffered data in that situation- the initial bug was 100% fixed in all our reproducers by setting the timeout to 0.01s. If it's still not there after 5s, it's likely either a bug in pexpect or the underlying platform stdio stream handling (I've personally encountered both)- waiting longer is unlikely to change anything. It'd probably make more sense to just eat the exception from that final buffer scrape and issue a warning that data could've been lost. I'm curious how often you're encountering this, and if it's specific to any particular distro/container runtime/whatever.

Thanks for replying. FWIW, this is in a container based on Alpine 3.18 with Python 3.11.4. The host OS is RHEL 8.

alexchowle commented 5 months ago

I'm having to spin-up on pexpect and the way stdout is hooked to get the CLI output so I may be off the mark, here - would there be any unintended consequences if the user's code in an event_handler callback took many seconds to complete?

alexchowle commented 5 months ago

I've got a repro for this issue - no container required. A combination of the number of hosts in the inventory (some of which are unreachable), and the time spent in the event_handler callback causes the TIMEOUT.

Code

import ansible_runner
from time import sleep

def my_cancel_handler():
    print("CANCELLATION")
    return False

def my_artifacts_handler(artifacts_dir):
    print(artifacts_dir)

def my_status_handler(data, runner_config):
    print(f"STATUS - config: {runner_config} data: {data}")       

def my_event_handler(data):
    print(f"EVENT - {data}")
    sleep(2)
    return True

def my_finished_handler(runner):
    print(f"FINISH")
    print(f"\tSTATS: {runner.stats}")

r = ansible_runner.run(
    private_data_dir="./private_data_dir",
    playbook="playbook.yml",
    cancel_callback=my_cancel_handler,
    artifacts_handler=my_artifacts_handler,
    status_handler=my_status_handler,
    event_handler=my_event_handler,
    finished_callback=my_finished_handler,
    settings={"suppress_ansible_output": False},
    timeout=1200,
    ignore_logging=False,
    debug=True)

Inventory I've redacted my host names:

[all]
HOST_1
HOST_2
HOST_3
HOST_4
HOST_5
HOST_6
HOST_7
HOST_8
HOST_9
UNREACHABLE_HOST_1
UNREACHABLE_HOST_2
UNREACHABLE_HOST_3
UNREACHABLE_HOST_4
UNREACHABLE_HOST_5
UNREACHABLE_HOST_6
UNREACHABLE_HOST_7
UNREACHABLE_HOST_8
UNREACHABLE_HOST_9
UNREACHABLE_HOST_10
UNREACHABLE_HOST_11
UNREACHABLE_HOST_12

Playbook

---
- name: breaker
  hosts: all
  gather_facts: no

  tasks:
    - name: Ping stuff
      ansible.builtin.ping:

Error I've mangled this slightly to use the redacted host names (note that the first character of the hostname is stripped - this is not part of the redaction):

pexpect.exceptions.TIMEOUT: <pexpect.pty_spawn.spawn object at 0x7fd6fa2d12d0>
command: /home/me/.local/bin/ansible-playbook
args: [b'/home/me/.local/bin/ansible-playbook', b'-i', b'/home/me/runner_recker/private_data_dir/inventory', b'playbook.yml']
buffer (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
before (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: 4
flag_eof: False
pid: 72622
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: <ansible_runner.utils.OutputEventFilter object at 0x7fd6fa2c8610>
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
    0: EOF
<pexpect.pty_spawn.spawn object at 0x7fd6fa2d12d0>
command: /home/me/.local/bin/ansible-playbook
args: [b'/home/me/.local/bin/ansible-playbook', b'-i', b'/home/me/runner_recker/private_data_dir/inventory', b'playbook.yml']
buffer (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
before (last 100 chars): 'NREACHABLE_HOST_3 port 22: Connection timed out", "unreachable": true}\r\n\x1b[Ke30=\x1b[4D\x1b['
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: 4
flag_eof: False
pid: 72622
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: <ansible_runner.utils.OutputEventFilter object at 0x7fd6fa2c8610>
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
    0: EOF

nitzmahone commented 4 months ago

Aha, a slow synchronous event handler blocking the stream consumption... Thanks for the repro- that clarified the issue nicely!

We're generally only using really lightweight event handlers that unconditionally persist to disk for later analysis, so even with the rare case of a lot of data in that final buffer, the overhead is pretty negligible.

This is one of those unfortunate problems with having pexpect wrapped up in the stream handling even when we're 100% sure it's no longer needed. We've got some longer-term ideas to eliminate it entirely, but they rely on broader architectural changes to Ansible Core that are still in the formative stages.

IMO this problem was arguably worse before the new stream drain fix, since those final events were always silently lost- now at least you know there was a problem (where before you just got a truncated event stream). The simple exception catch + warning I proposed to "fix" this though is almost certainly not the right thing. We'll need to balance some accommodation for long-running/blocking event handlers with the need to respect other timeouts properly. A proper fix will probably loop-poll for EOF and consult the stream position to see if it moved between calls and/or track in-flight event handlers before deciding whether to give up or reissue another poll.

Meantime, just spitballing, but is there any reason your long-running event handler needs to be synchronous with the input pipeline (e.g., does the long-running part affect a filtering decision?), or could you offload it to a threadpool or other mechanism that wouldn't block the pump?

alexchowle commented 4 months ago

We've already moved the event handler logic to another thread and the problem has gone. Nothing like a slow-consumer problem to remind you why you need to be quick on a callback. Happy days!.

ansible / ansible-runner

`pexpect.exceptions.TIMEOUT` due to `child.expect(pexpect.EOF, timeout=5)` #1343