ParallelSSH / parallel-ssh

Asynchronous parallel SSH client library.
https://parallel-ssh.org
GNU Lesser General Public License v2.1
1.2k stars 148 forks source link

Intermittent error after calling disconnect: "libev: I/O watcher with invalid fd found in epoll_ctl" #378

Open todds02 opened 1 year ago

todds02 commented 1 year ago

For general questions please use the mail group.

Describe the bug

Using parallel-ssh single client under gevent and python3. Intermittently, after calling disconnect() on the ssh session, the python interpreter crashes with the error message

"python3: ev_epoll.c:153: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP && errno != EINVAL)' failed."

To Reproduce

I can reproduce this using a slightly modified version of the example script (https://parallel-ssh.readthedocs.io/en/latest/quickstart.html#single-host-client)

from pssh.clients import SSHClient

attempts = 0
while True:
    attempts += 1
    host = 'server.example.com_'
    cmd = 'ls -al /'
    print('Connection attempt {}'.format(attempts))
    client = SSHClient(host, user=USERNAME, password=PASSWORD)

    host_out = client.run_command(cmd)
    for line in host_out.stdout:
        print(line)
    print('Disconnecting')
    client.disconnect()
    gevent.sleep(0.1)

Within <50 attempts, the error message is seen. What's also odd is the output is only printed every other time through the loop.

Disconnecting
Connecting attempt 26
Disconnecting
Connecting attempt 27
total 106
dr-xr-xr-x.  24 root root  4096 Oct  5  2022 .
dr-xr-xr-x.  24 root root  4096 Oct  5  2022 ..
<snipped for brevity>
drwxr-xr-x.  17 root root  4096 Jan 24  2018 var
drwxr-xr-x.   6 root root  4096 Apr  1  2022 work
Disconnecting
python3: ev_epoll.c:153: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP && errno != EINVAL)' failed.

In some other testing, we found that if we don't access output, then the epoll error goes away, but sometimes output.exit_code is None even after a call to wait_finished(output).

In the full script (which I cannot post here), adding a sleep(1) after the disconnect seems to clear up the issue - it doesn't work in the above example, though.

Most of the examples I could find do not show explicit disconnect() calls, but these are necessary to ensure proper cleanup for long-running processes. Is there a safe way to disconnect that I'm missing, or is something not being cleaned up properly internally?

Expected behavior

The session to disconnect cleanly without crashing

Actual behaviour

A crash is intermittently seen

Screenshots

Additional information

python3.6 python-gevent 22.10.2 python-greenlet 2.0.2 parallel-ssh 2.12.0 python3-ssh2-python 1.0.0.0 libssh2 1.9.0-5