Closed rward-ddn closed 2 years ago
Hi there,
Thanks for the interest.
It sounds like the client is doing connection retries after authentication failed rather than the command stalling.
Retries can be disabled with SSHClient(<..>, num_retries=1)
to see what the error from the server is. Can also adjust delay between retries with retry_delay
. The above script runs fine for me, ran it 100 times with no stalls:
$ for (( i=0; i<100; i++ )); do python test.py; done
Total time 0.043349 seconds
Total time 0.043384 seconds
Total time 0.043329 seconds
Total time 0.043166 seconds
Total time 0.043997 seconds
<..>
Total time 0.043954 seconds
SSH servers also have limits on number of sessions they allow, as well as number of startups they allow at one time. See MaxStartups
, MaxSessions
and MaxAuthTries
sections of man sshd_config
. Check your server logs to see why the server is not allowing a login if that is the case. PAM may also have its own limits. Check authentication logs.
Hmm, it's not the connection retries, I've tried running with enable_debug_logger() and it only shows the single connection.
Connecting to <ip>
Proceeding with private key authentication
Private key is provided as str, loading from private key file path
Authentication completed successfully - setting session to non-blocking mode
Executing command 'b'cat /tmp/red-bdev-rand-rw.fio''
Closing channel
Reading from stdout buffer, timeout=180.0
The stall happens between Executing command
and Closing channel
. I'll poke around our environment stuff and see if anything seems relevant.
Thank you for the debug output.
The library does not do anything between Executing command <..>
and Closing channel
other than wait for the server to respond that the command has finished executing.
In your authentication/SSH server logs, you should see lines like (this will vary depending on the system)
Opening session for <username>
Session closed for <username>
PAM logs will have similar lines for the opening of a PTY and there are limits on those as well.
I expect running the same command with ssh -t <hostname> "cat /tmp/red-bdev-rand-rw.fio"
behaves the same. In which case it is an environment issue.
To be closed unless an issue specific to the library can be reproduced.
Closing looks good. I'm currently thinking this is an issue with the native client somehow tickling a driver issue. I've tried again with the other client and haven't been able to hit the stall, but when running with the native client I see log messages related to a driver we have some known issues with, although since it has nothing to do with networking I'm not sure how.
I've tried again with the other client
Do you mean pssh.clients.ssh.SSHClient
or another library?
With pssh.clients.ssh.SSHClient and no other changes.
Thanks for the feedback.
That is a difference in behaviour in the two clients, which I consider a bug. The underlying libraries are different, but the purpose of the clients is to normalise that different behaviour.
I have a good idea of the cause and it should be able to be handled in the native client as well. The two libraries handle the stdout/stderr streams differently and it looks like that can cause a race condition in the native client when they are combined, as they are when a PTY is used.
Thanks for reporting.
I believe I am encountering the same issue, however
as they are when a PTY is used
I am not using a pty, and I read stdout and stderr separately.
It seems related to setting any kind of timeout in the SSHClient constructor. If timeout is not specified, I do not observe any kind of stalling. However, since I require a timeout (as in my use case, the target can stop responding completely) I dug into the source and I tried hacking my way into setting a timeout for my use case without triggering the issue (as unfortunately I do not have the time to try to create a proper patch that fixes the actual issue).
My "solution" is to keep specifying timeout=None
in the constructor, but manually forcing the timeout here:
https://github.com/ParallelSSH/parallel-ssh/blob/5cea5c105bcb8cbac9ce7c5996a57dc681468316/pssh/clients/base/single.py#L549
For example, with GTimeout(seconds=15):
.
This successfully works around the stalls for me, while still preserving the general timeout in case the ssh server stops responding.
Hope this helps in some way. If it is actually a different problem, I can open a new issue.
That's very helpful in trying to reproduce this, sounds like it's the same issue, thank you.
After playing around a bit with this (really nice library btw :) ) I think this issue comes from this line: https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/base/single.py#L699
Changing this to
poller.poll(10)
seems to make the stalls go away, and socket timeouts still seems to be triggered from my limited testing. I haven't dug deeper for the root cause but maybe it causes some kind of deadlock waiting for data that is fetched in the same greenlet if it is waiting for 60s on that line for example?
On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line
https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80
to sleep(.001)
but that might be a CPU/latency tradeoff, however I don't see any high CPU usage by setting these values a bit lower (there are also two such sleeps in https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/ssh/single.py ).
On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80 to sleep(.001)
Yes, you are right, it is a CPU/latency tradeoff. I have a branch with performance enhancements I am experimenting with and the above is one of the changes it makes. There are some very promising results, but I want to do a lot more testing with real-world environments before merging those changes. Watch out for updates soon.
On this issue, thank you for the investigation, have been able to replicate.
2.11.1
resolves this issue. It was indeed a race condition when timeout was used in native client.
Thanks for the investigation @SvanT
I've done some work to convert from paramiko to parallel-ssh, but have hit an issue where I'm sometimes seeing very short commands stall and take several minutes to complete. In this case, we're running cat on a small fio config file. I've been able to reproduce this with a simple script, getting results like this:
I'm running this on an Ubuntu 20.04 system with the target also being an Ubuntu 20.04 system. I have not seen this issue with commands that take longer to run.
Script:
Contents of the red-bdev-rand-rw.fio file: