Open briandfoy opened 1 year ago
I've found the bug, and have fixed it as described below. I don't think this is the best way to fix it, but it did prove my theory on how the bug was occuring.
The problem occurs due to the interaction between drain_outgoing (in Channel.pm) and client_loop in (SSH2.pm).
If your $stdout is bigger than remote_maxpacket, drain_outgoing tries to repeatedly call client_loop until the length is reduced to zero. The problem is that client_loop will not return back to drain_outgoing, once it is called from inside the while loop the second time.
MY FIX: I reasoned that if I could force the client_loop to execute one-and-only-one-time, for each time it was called from the drain_outgoing while loop, there would be no problem. To test this, I did the following:
1.) drain_outgoing: added the following line before the while loop
$c->{ssh}->{DoOneLoop}=1;
2.) drain_outgoing: added the following line after the while loop
undef($c->{ssh}->{DoOneLoop});
3.) client_loop: added the following line just before the end of the first while loop
last if($ssh->{DoOneLoop});
This seems to have fixed it for all scenarios that I've tested.
If you know of a better solution, or know how to get this into the official distribution, please email me: craig_at_lucent_dot_co
On Monday August 15, 2005 07:14, Guest via RT wrote:
Full context and any attached attachments can be found at: <URL: https://rt.cpan.org/Ticket/Display.html?id=7910 >
...
If you know of a better solution, or know how to get this into the official distribution, please email me: craig_at_lucent_dot_com
Patches can be submitted to the list or to me (DBROBINS at cpan.org) directly.
Please include:
I'll review it, and either apply it (possibly with changes) or get back to you with suggested changes.
Thanks,
-- Dave Isa. 40:31
[guest - Mon Aug 15 10:14:33 2005]:
I've found the bug, and have fixed it as described below. I don't think this is the best way to fix it, but it did prove my theory on how the bug was occuring. [excellent fix snip]
This seems to have fixed it for all scenarios that I've tested.
There are two scenarios where this is isn't totally fixed:
In these cases, with the DoOneLoop fix, the select() call inside client_loop hangs indefinitely with a $stdin to cmd() that is greater than 32768B. More interestingly, it appears that data on STDIN is put onto the resulting channel which is then put into the fd on the remote peer.
We've implemented the above, with the following additional changes in SSH2.pm, Constants.pm, and Channel.pm:
diff /opt/rcs/lib/Net/SSH/Perl/Channel.pm
/opt/rcs/os_deployment/lib/Net/SSH/Perl/Channel.pm
191a192,193
> ## COVD FIX:
> $c->{ssh}->{DoOneLoop} = 10;
195a198,200
> ## COVD FIX:
> undef( $c->{ssh}->{DoOneLoop} );
> delete $c->{ssh}->{DoOneLoop};
diff /opt/rcs/lib/Net/SSH/Perl/Constants.pm
/opt/rcs/os_deployment/lib/Net/SSH/Perl/Constants.pm
138c138
< 'MAX_PACKET_SIZE' => 256000,
---
> 'MAX_PACKET_SIZE' => 8192,
diff /opt/rcs/lib/Net/SSH/Perl/SSH2.pm /opt/rcs/os_deployment/lib/Net/SSH/Perl/SSH2.pm
302c303,306
< my($rready, $wready) = $select_class->select($rb, $wb);
---
> ## COVD FIX:
> $ssh->debug("Instantiating a select with $ssh->{DoOneLoop} second timeout.")
> if (exists $ssh->{DoOneLoop} && $ssh->{DoOneLoop} > 0);
> my($rready, $wready) = $select_class->select($rb, $wb, undef,($ssh->{DoOneLoop} || undef));
313a318,319
> ## COVD FIX:
> last if ($ssh->{DoOneLoop});
This workaround appears to function correctly with any filesize for SSH implementations identifing themselves as SSH-1.99-OpenSSH_3.7.1p2, SSH-1.99-Sun_SSH_1.0.1, and SSH-1.99-Sun_SSH_1.1.
I'm pretty sure this shouldn't make it into an official patch; I'm just documenting this for anyone else bitten by this. ;)
-jgilbert.
Folks-
Not only is this bug appearing on many different platforms, but the fix I proposed earlier (and the subsequent jgilbert fix) doesn't work on some of them with v1.30 (ppc-linux in particular).
I've written a perl script to cause the bug to occur, and identify just how many characters your system needs to trigger it. Can you try running it and see if you have the bug and post your results here? I've also sent this to Dave Robins to see if he can figure this but out once and for all.
My output is below. I'm attaching the perl test script.
Thanks -Craig
Net::SSH::Perl bug test...
host=nwsgpb user=watchmrk remoteCmd=cat
xferChars=932779|-Error, down to 466390 chars
Received disconnect message: Corrupted MAC on input.
xferChars=466390|-Error, down to 233195 chars
Received disconnect message: Corrupted MAC on input.
xferChars=233195|-Error, down to 116598 chars
Received disconnect message: Corrupted MAC on input.
xferChars=116598|-Error, down to 58299 chars
Received disconnect message: Corrupted MAC on input.
xferChars=58299|-Error, down to 29150 chars
alarm handler 10 second timeout
xferChars=29150|-Error, down to 14575 chars
alarm handler 10 second timeout
xferChars=14575|-Error, down to 7288 chars
alarm handler 10 second timeout
xferChars=7288|-OK, back up to 10931 chars
xferChars=10931|-OK, back up to 12753 chars
xferChars=12753|-Error, down to 11842 chars
alarm handler 10 second timeout
xferChars=11842|-Error, down to 11387 chars
alarm handler 10 second timeout
xferChars=11387|-OK, back up to 11614 chars
xferChars=11614|-Error, down to 11501 chars
alarm handler 10 second timeout
xferChars=11501|-OK, back up to 11557 chars
xferChars=11557|-Error, down to 11529 chars
Received disconnect message: Corrupted MAC on input.
xferChars=11529|-OK, back up to 11543 chars
xferChars=11543|-Error, down to 11536 chars
Received disconnect message: Corrupted MAC on input.
xferChars=11536|-Error, down to 11533 chars
alarm handler 10 second timeout
xferChars=11533|-OK, back up to 11534 chars
xferChars=11534|-OK, back up to 11535 chars
xferChars=11535|-OK, back up to 11535 chars
Report for remote host: nwsgpb...
xferChars=11535 works, xferChars=11536 hangs on timeout=10.
exiting...
This ticket needs to be re-opened, as I've encountered it again on version 1.34. I've verified both the bug, and the DoOneLoop fix, under windows (XP), MacOSX, and Solaris. However, the included detection script (as-is) does not catch the bug anymore. If you modify the high value to 98304 or below, it will catch the bug when you connect to a solaris box running SSH-2.0- Sun_SSH_1.1.3 (and possibly others).
This ticket was imported from rt.cpan.org 7910
This probalbly is not a bug but a problem with my older version of openssh on the client machine but:
will cause the process to hang when dumping more than say 6K of data.
The following code sample from pscp fixes the problem (file contents in $c):
Debug ouput including version numbers: