Quickly completed commands sometimes stall

rward-ddn commented 2 years ago

I've done some work to convert from paramiko to parallel-ssh, but have hit an issue where I'm sometimes seeing very short commands stall and take several minutes to complete. In this case, we're running cat on a small fio config file. I've been able to reproduce this with a simple script, getting results like this:

Started at 2022-04-25 09:01:52.893054, ended at 2022-04-25 09:01:53.672184, total time is 0:00:00.779130
Started at 2022-04-25 09:01:54.592475, ended at 2022-04-25 09:01:55.372624, total time is 0:00:00.780149
Started at 2022-04-25 09:01:56.312288, ended at 2022-04-25 09:01:57.041410, total time is 0:00:00.729122
Started at 2022-04-25 09:01:57.896660, ended at 2022-04-25 09:04:58.563031, total time is 0:03:00.666371

I'm running this on an Ubuntu 20.04 system with the target also being an Ubuntu 20.04 system. I have not seen this issue with commands that take longer to run.

Script:

#!/usr/bin/env python3
'''
Quick script to try to reproduce stall with paralle-ssh
'''

from pssh.clients.native import SSHClient
from pssh import exceptions
from datetime import datetime

hostname = "<target host>"
cmd = "cat /tmp/red-bdev-rand-rw.fio"
stdout = ""
stderr = ""
cmd_timeout = 180.0
login = "<user>"
password = "<password>"
port_num = 22
connect_retry_count = 3
keyfile = "<keyfile>"

client = SSHClient(host=hostname, user=login, password=password, port=port_num,
            num_retries=connect_retry_count, allow_agent=False, identity_auth=False, pkey=keyfile, timeout=cmd_timeout)

start = datetime.now()
host_out = client.run_command(cmd, use_pty=True, timeout=cmd_timeout)
client.wait_finished(host_output=host_out)

try:
    for line in host_out.stdout:
        stdout += line
    for line in host_out.stderr:
        stderr += line
    retcode = host_out.exit_code
except exceptions.Timeout as err:
    # May as well pull all available output
    for line in host_out.stdout:
        stdout += line
    for line in host_out.stderr:
        stderr += line
    retcode = host_out.exit_code
    raise AssertionError(f"Command {cmd} timed out on host {hostname} after {cmd_timeout} seconds. "
                                f"Partial output: {stdout} stderr: {stderr}") from err
except Exception as err:
    raise AssertionError(f"Failed in rtfutils with error {err}") from err
finally:
    client.close_channel(channel=host_out.channel)
    done = datetime.now()

print(f"Started at {start}, ended at {done}, total time is {done - start}")

Contents of the red-bdev-rand-rw.fio file:

#red-bdev-rand-rw test
[global]
name=red-bdev-rand-rw
ioengine=${IOENGINE}
filename=${FILENAME}
size=${SIZE}
direct=1
group_reporting=1
thread=1
time_based=1
runtime=90
blocksize_range=4k:3m
rw=randrw
[file1]
iodepth=16
numjobs=1
[file2]
iodepth=16
numjobs=2
[file3]
iodepth=16
numjobs=4
[file4]
iodepth=8
numjobs=8
[file5]
iodepth=4
numjobs=16
[file6]
iodepth=4
numjobs=32

pkittenis commented 2 years ago

Hi there,

Thanks for the interest.

It sounds like the client is doing connection retries after authentication failed rather than the command stalling.

Retries can be disabled with SSHClient(<..>, num_retries=1) to see what the error from the server is. Can also adjust delay between retries with retry_delay. The above script runs fine for me, ran it 100 times with no stalls:

$ for (( i=0; i<100; i++ )); do python test.py; done
Total time 0.043349 seconds
Total time 0.043384 seconds
Total time 0.043329 seconds
Total time 0.043166 seconds
Total time 0.043997 seconds
<..>
Total time 0.043954 seconds

SSH servers also have limits on number of sessions they allow, as well as number of startups they allow at one time. See MaxStartups, MaxSessions and MaxAuthTries sections of man sshd_config. Check your server logs to see why the server is not allowing a login if that is the case. PAM may also have its own limits. Check authentication logs.

rward-ddn commented 2 years ago

Hmm, it's not the connection retries, I've tried running with enable_debug_logger() and it only shows the single connection.

Connecting to <ip>
Proceeding with private key authentication
Private key is provided as str, loading from private key file path
Authentication completed successfully - setting session to non-blocking mode
Executing command 'b'cat /tmp/red-bdev-rand-rw.fio''
Closing channel
Reading from stdout buffer, timeout=180.0

The stall happens between Executing command and Closing channel. I'll poke around our environment stuff and see if anything seems relevant.

pkittenis commented 2 years ago

Thank you for the debug output.

The library does not do anything between Executing command <..> and Closing channel other than wait for the server to respond that the command has finished executing.

In your authentication/SSH server logs, you should see lines like (this will vary depending on the system)

Opening session for <username>
Session closed for <username>

PAM logs will have similar lines for the opening of a PTY and there are limits on those as well.

I expect running the same command with ssh -t <hostname> "cat /tmp/red-bdev-rand-rw.fio" behaves the same. In which case it is an environment issue.

To be closed unless an issue specific to the library can be reproduced.

rward-ddn commented 2 years ago

Closing looks good. I'm currently thinking this is an issue with the native client somehow tickling a driver issue. I've tried again with the other client and haven't been able to hit the stall, but when running with the native client I see log messages related to a driver we have some known issues with, although since it has nothing to do with networking I'm not sure how.

pkittenis commented 2 years ago

I've tried again with the other client

Do you mean pssh.clients.ssh.SSHClient or another library?

rward-ddn commented 2 years ago

With pssh.clients.ssh.SSHClient and no other changes.

pkittenis commented 2 years ago

Thanks for the feedback.

That is a difference in behaviour in the two clients, which I consider a bug. The underlying libraries are different, but the purpose of the clients is to normalise that different behaviour.

I have a good idea of the cause and it should be able to be handled in the native client as well. The two libraries handle the stdout/stderr streams differently and it looks like that can cause a race condition in the native client when they are combined, as they are when a PTY is used.

Thanks for reporting.

tatokis commented 2 years ago

I believe I am encountering the same issue, however

as they are when a PTY is used

I am not using a pty, and I read stdout and stderr separately.

It seems related to setting any kind of timeout in the SSHClient constructor. If timeout is not specified, I do not observe any kind of stalling. However, since I require a timeout (as in my use case, the target can stop responding completely) I dug into the source and I tried hacking my way into setting a timeout for my use case without triggering the issue (as unfortunately I do not have the time to try to create a proper patch that fixes the actual issue).

My "solution" is to keep specifying timeout=None in the constructor, but manually forcing the timeout here: https://github.com/ParallelSSH/parallel-ssh/blob/5cea5c105bcb8cbac9ce7c5996a57dc681468316/pssh/clients/base/single.py#L549 For example, with GTimeout(seconds=15):.

This successfully works around the stalls for me, while still preserving the general timeout in case the ssh server stops responding.

Hope this helps in some way. If it is actually a different problem, I can open a new issue.

pkittenis commented 2 years ago

That's very helpful in trying to reproduce this, sounds like it's the same issue, thank you.

SvanT commented 2 years ago

After playing around a bit with this (really nice library btw :) ) I think this issue comes from this line: https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/base/single.py#L699

Changing this to poller.poll(10) seems to make the stalls go away, and socket timeouts still seems to be triggered from my limited testing. I haven't dug deeper for the root cause but maybe it causes some kind of deadlock waiting for data that is fetched in the same greenlet if it is waiting for 60s on that line for example?

On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80 to sleep(.001) but that might be a CPU/latency tradeoff, however I don't see any high CPU usage by setting these values a bit lower (there are also two such sleeps in https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/ssh/single.py ).

pkittenis commented 2 years ago

On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80 to sleep(.001)

Yes, you are right, it is a CPU/latency tradeoff. I have a branch with performance enhancements I am experimenting with and the above is one of the changes it makes. There are some very promising results, but I want to do a lot more testing with real-world environments before merging those changes. Watch out for updates soon.

On this issue, thank you for the investigation, have been able to replicate.

pkittenis commented 2 years ago

2.11.1 resolves this issue. It was indeed a race condition when timeout was used in native client.

Thanks for the investigation @SvanT

ParallelSSH / parallel-ssh

Quickly completed commands sometimes stall #344