Open GoogleCodeExporter opened 9 years ago
[deleted comment]
Your problem seems to be the same reported in this discussion[1].
There is a regression in sshd[2] which is causing the ssh thread to get stuck
and when all the stream-events threads are stuck, then the stream-events
command no longer receive events.
This problem is supposed be fixed in sshd 0.13.0. If you build latest
stable-2.9 branch, the problem should be fixed since it include sshd 0.13.0[3].
This will eventually be released as 2.9.2.
Let us know if this fix your issue.
[1]https://groups.google.com/forum/#!searchin/repo-discuss/stream$20events/repo-
discuss/4va1DH520to
[2]https://issues.apache.org/jira/browse/SSHD-348
[3]https://gerrit-review.googlesource.com/#/c/61353/
Original comment by huga...@gmail.com
on 12 Nov 2014 at 2:12
Thanks for the details (very helpful).
I have built a custom version of Gerrit based on v2.9.1 plus the following
patches,
SSHD: Update to 0.13.0
Bump SSHD Mina version to 2.0.8
Bump Bouncycastle version to 1.51
Update EncryptedContactStore to not use deprecated/removed methods
Running validation tests now.
Original comment by burmawal...@gmail.com
on 12 Nov 2014 at 5:24
Any updates? Did it fix your problem?
Original comment by huga...@gmail.com
on 14 Nov 2014 at 2:28
Yusuf, any update on this?
Original comment by david.pu...@sonymobile.com
on 19 Nov 2014 at 2:57
Hi David,
Apologies for the delayed response but wanted to give myself some time to test
these patches before declaring the issue to be fixed. After running the patched
version for almost a week, I couldn't repro the bug but again there was no
systematic way to repro the bug on v2.9.1 either.
Can you recommend some test cases that I should run in order to validate these
patches?
Do we have a release date for v2.9.2 where these patches will be officially
released?
Br,
Yusuf
Original comment by burmawal...@gmail.com
on 19 Nov 2014 at 8:53
We don't experience the error, so we also don't know any way to reproduce it.
2.9.2 is pending on verification that the SSHD fixes the issues, and also there
is a fix for the primary key order that needs to be included. Hoping it will
be within the next couple of weeks. 2.10 should follow not long after.
Original comment by david.pu...@sonymobile.com
on 19 Nov 2014 at 8:56
We've upgraded yesterday from 2.9.1 to 2.9.2 to have this issue solved but it
seems to be there still.
We've restarted Gerrit today at 2:35pm and the ssh connections are piling up
since then. Now, at 5:11pm "lsof -n -i :29418|grep -c ESTABLISHED" tells me
there are 77 connections in the ESTABLISHED state. They all connect to our
Jenkins servers.
In the error_log I only see 25 messages like this since the restart:
[2014-11-29 16:58:41,689] WARN com.google.gerrit.sshd.GerritServerSession :
Exception caught
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:302)
at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:45)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:694)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:668)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:657)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1121)
at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
But these messages always ocurred with this frequency even before we upgraded
from 2.8.4 to 2.9.1 last Wednesday, on 2014-22-26.
I'm attaching the file jstack.13374.gz produced with "jstack -F <pid>" to get
thread dumps from Gerrit. I couldn't find any stack trace similar to the one
described on https://issues.apache.org/jira/browse/SSHD-348 though. Is there a
better way to get thread dumps? (I don't speak Java, sigh.)
PS: For the record, now, at 5:27pm, there are 83 connections in the ESTABLISHED
state.
Original comment by gustavo@gnustavo.com
on 29 Nov 2014 at 7:27
Attachments:
The issue is/was stuck threads because the ssh library is waiting for a
disconnected client to empty its buffer. This was causing a depletion of the
stream event thread pool which lead to stream event stop.
What you are describing could be a normal behaviour. Jenkins Gerrit-Trigger use
the stream-event command which only ends when you stop GT so it is normal that
you have a lot of established ssh connections, one per Jenkins that use GT.
If you say that you have more connections than Jenkins instance with GT, there
could be a problem on GT side. I remember we had some connection issues at one
point with GT when using its connection watchdog feature but we fixed them and
all the fixes are included in 2.12.0.
Original comment by huga...@gmail.com
on 1 Dec 2014 at 1:56
I'm discussing this in the mailing list. It seems that the problem is on the
Jenkins side.
https://groups.google.com/d/msg/repo-discuss/NuFti4SVNQM/WAqBaV0bGFAJ
Original comment by gustavo@gnustavo.com
on 1 Dec 2014 at 3:54
Original issue reported on code.google.com by
burmawal...@gmail.com
on 12 Nov 2014 at 8:42