delano / rye

Safe, parallel access to Unix shells from Ruby
http://delano.github.com/rye
MIT License
235 stars 32 forks source link

Long running SSH connections causing exceptions/failures #38

Open CpuID opened 11 years ago

CpuID commented 11 years ago

I have some long running scripts that maintain SSH connections, by virtue that Rye boxes keep the SSH connection open persistently from initial use through to script completion (unless you call disconnect in between).

I seem to get the below after a few minutes:

Failure on thread threadname: connection closed by remote host /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/transport/packet_stream.rb:87:in next_packet': connection closed by remote host (Net::SSH::Disconnect) from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/transport/session.rb:172:inpoll_message' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/transport/session.rb:167:in loop' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/transport/session.rb:167:inpoll_message' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:454:in dispatch_incoming_packets' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:216:inpreprocess' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:200:in process' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:164:inloop' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:164:in loop_forever' from /var/lib/gems/1.8/gems/net-ssh-2.6.8/lib/net/ssh/connection/session.rb:164:inloop' from /var/lib/gems/1.8/gems/rye-0.9.8/lib/rye/box.rb:717:in disconnect' from /usr/lib/ruby/1.8/timeout.rb:67:intimeout' from /var/lib/gems/1.8/gems/rye-0.9.8/lib/rye/box.rb:716:in disconnect' from /var/lib/gems/1.8/gems/rye-0.9.8/lib/rye/box.rb:142:ininitialize'

I have checked the server side, and the auth.log contains:

Sep 17 03:26:54 XXX sshd[31216]: Timeout, client not responding. Sep 17 03:26:54 XXX sshd[31197]: pam_unix(sshd:session): session closed for user XXX

delano commented 11 years ago

If there's no output sent by or received by the client, it's possible that the ssh server or some point along the network is closing the connection.

Is the script executing a bunch of commands or just calling a couple really slow one?

CpuID commented 11 years ago

Couple of really slow ones, in this case a tar xjf on a 20-30GB file.

One thing to note, I am calling the Rye box from within a Thread.new { } block, mainly because I perform an rsync then an untar within the same thread (3-4 boxes at a time, each one waits for its rsync to finish independently, then Rye is called to perform the untar straight after).

So it is possible one box might be running a command and waiting, another box might have a connection open but doing absolutely nothing in a different thread.

Ideally if I can enable some kind of keep-alive that would be the preferred option so I can move forward... :)

delano commented 11 years ago

Two things:

CpuID commented 11 years ago

Yea I use Rye sets already actually :) The main thing here is I need to use ruby-rsync within the thread in addition to Rye calls to boxes, hence the use of a separate wrapper Thread.new call.

I see what you mean, to force some output to be generated. Ideally though having SSH keep-alive would be nicer :)

CpuID commented 11 years ago

In the end I think I solved it. I did a disconnect on all boxes that I am not using during the long running segment. I think I had some boxes sitting there idle, and they were causing havoc with the ones that were in use.