CandyShop / gerrit

Automatically exported from code.google.com/p/gerrit
Apache License 2.0
1 stars 0 forks source link

Stream events abruptly stops with Gerrit 2.9.1 #3013

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
************************************************************
***** NOTE: THIS BUG TRACKER IS FOR GERRIT CODE REVIEW *****
***** DO NOT SUBMIT BUGS FOR CHROME, ANDROID, INTERNAL *****
***** ISSUES WITH YOUR COMPANY'S GERRIT SETUP, ETC.    *****
***** THOSE ISSUE BELONG IN DIFFERENT ISSUE TRACKERS!  *****
************************************************************

Affected Version:

What steps will reproduce the problem?
1. The stream events command abruptly stops notifying the events
2. The problem gets resolved by restarting Gerrit however the events stream 
stops again within few hours of restart.

What is the expected output? What do you see instead?
The stream events should send notifications without any problems.

Please provide any additional information below.
Every time stream events stops responding and we stop Gerrit, we are see the 
stack trace below in the log:

WARN  com.google.gerrit.server.git.WorkQueue$Lifecycle : Failed to stop
com.google.inject.ProvisionException: Guice provision errors:

1) Error in custom provider, com.google.inject.OutOfScopeException: Not in 
command/request
  while locating com.google.gerrit.sshd.SshScope$ContextProvider
  while locating com.google.gerrit.sshd.SshScope$Context

1 error
        at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1014)
        at com.google.gerrit.sshd.SshLog.onExecute(SshLog.java:154)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.log(CommandFactoryProvider.java:212)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.access$800(CommandFactoryProvider.java:104)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline$2.onExit(CommandFactoryProvider.java:184)
        at com.google.gerrit.sshd.BaseCommand.onExit(BaseCommand.java:305)
        at com.google.gerrit.sshd.commands.StreamEvents.onExit(StreamEvents.java:138)
        at com.google.gerrit.sshd.commands.StreamEvents$3.cancel(StreamEvents.java:87)
        at com.google.gerrit.server.git.WorkQueue$Task.cancel(WorkQueue.java:321)
        at java.util.concurrent.ScheduledThreadPoolExecutor.onShutdown(ScheduledThreadPoolExecutor.java:366)
        at java.util.concurrent.ThreadPoolExecutor.shutdown(ThreadPoolExecutor.java:1393)
        at java.util.concurrent.ScheduledThreadPoolExecutor.shutdown(ScheduledThreadPoolExecutor.java:759)
        at com.google.gerrit.server.git.WorkQueue.stop(WorkQueue.java:154)
        at com.google.gerrit.server.git.WorkQueue.access$000(WorkQueue.java:51)
        at com.google.gerrit.server.git.WorkQueue$Lifecycle.stop(WorkQueue.java:66)
        at com.google.gerrit.lifecycle.LifecycleManager.stop(LifecycleManager.java:88)
        at com.google.gerrit.pgm.Daemon$2.run(Daemon.java:209)
        at com.google.gerrit.pgm.util.RuntimeShutdown$ShutdownCallback.run(RuntimeShutdown.java:86)
Caused by: com.google.inject.OutOfScopeException: Not in command/request
        at com.google.gerrit.sshd.SshScope.requireContext(SshScope.java:153)
        at com.google.gerrit.sshd.SshScope.access$300(SshScope.java:38)
        at com.google.gerrit.sshd.SshScope$ContextProvider.get(SshScope.java:118)
        at com.google.gerrit.sshd.SshScope$ContextProvider.get(SshScope.java:115)
        at com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:86)
        at com.google.inject.internal.BoundProviderFactory.provision(BoundProviderFactory.java:73)
        at com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:66)
        at com.google.inject.internal.BoundProviderFactory.get(BoundProviderFactory.java:63)
        at com.google.inject.internal.InjectorImpl$3$1.call(InjectorImpl.java:1005)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1051)
        at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1001)

Original issue reported on code.google.com by burmawal...@gmail.com on 12 Nov 2014 at 8:42

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Your problem seems to be the same reported in this discussion[1].

There is a regression in sshd[2] which is causing the ssh thread to get stuck 
and when all the stream-events threads are stuck, then the stream-events 
command no longer receive events.

This problem is supposed be fixed in sshd 0.13.0. If you build latest 
stable-2.9 branch, the problem should be fixed since it include sshd 0.13.0[3].

This will eventually be released as 2.9.2.

Let us know if this fix your issue.

[1]https://groups.google.com/forum/#!searchin/repo-discuss/stream$20events/repo-
discuss/4va1DH520to
[2]https://issues.apache.org/jira/browse/SSHD-348
[3]https://gerrit-review.googlesource.com/#/c/61353/

Original comment by huga...@gmail.com on 12 Nov 2014 at 2:12

GoogleCodeExporter commented 9 years ago
Thanks for the details (very helpful).

I have built a custom version of Gerrit based on v2.9.1 plus the following 
patches,
SSHD: Update to 0.13.0
Bump SSHD Mina version to 2.0.8
Bump Bouncycastle version to 1.51
Update EncryptedContactStore to not use deprecated/removed methods

Running validation tests now.

Original comment by burmawal...@gmail.com on 12 Nov 2014 at 5:24

GoogleCodeExporter commented 9 years ago
Any updates? Did it fix your problem?

Original comment by huga...@gmail.com on 14 Nov 2014 at 2:28

GoogleCodeExporter commented 9 years ago
Yusuf, any update on this?

Original comment by david.pu...@sonymobile.com on 19 Nov 2014 at 2:57

GoogleCodeExporter commented 9 years ago
Hi David,

Apologies for the delayed response but wanted to give myself some time to test 
these patches before declaring the issue to be fixed. After running the patched 
version for almost a week, I couldn't repro the bug but again there was no 
systematic way to repro the bug on v2.9.1 either.

Can you recommend some test cases that I should run in order to validate these 
patches?

Do we have a release date for v2.9.2 where these patches will be officially 
released?

Br,
Yusuf

Original comment by burmawal...@gmail.com on 19 Nov 2014 at 8:53

GoogleCodeExporter commented 9 years ago
We don't experience the error, so we also don't know any way to reproduce it.

2.9.2 is pending on verification that the SSHD fixes the issues, and also there 
is a  fix for the primary key order that needs to be included.  Hoping it will 
be within the next couple of weeks.  2.10 should follow not long after.

Original comment by david.pu...@sonymobile.com on 19 Nov 2014 at 8:56

GoogleCodeExporter commented 9 years ago
We've upgraded yesterday from 2.9.1 to 2.9.2 to have this issue solved but it 
seems to be there still.

We've restarted Gerrit today at 2:35pm and the ssh connections are piling up 
since then. Now, at 5:11pm "lsof -n -i :29418|grep -c ESTABLISHED" tells me 
there are 77 connections in the ESTABLISHED state. They all connect to our 
Jenkins servers.

In the error_log I only see 25 messages like this since the restart:

[2014-11-29 16:58:41,689] WARN  com.google.gerrit.sshd.GerritServerSession : 
Exception caught
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:197)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:302)
    at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:45)
    at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:694)
    at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:668)
    at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:657)
    at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
    at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1121)
    at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

But these messages always ocurred with this frequency even before we upgraded 
from 2.8.4 to 2.9.1 last Wednesday, on 2014-22-26.

I'm attaching the file jstack.13374.gz produced with "jstack -F <pid>" to get 
thread dumps from Gerrit. I couldn't find any stack trace similar to the one 
described on https://issues.apache.org/jira/browse/SSHD-348 though. Is there a 
better way to get thread dumps? (I don't speak Java, sigh.)

PS: For the record, now, at 5:27pm, there are 83 connections in the ESTABLISHED 
state.

Original comment by gustavo@gnustavo.com on 29 Nov 2014 at 7:27

Attachments:

GoogleCodeExporter commented 9 years ago
The issue is/was stuck threads because the ssh library is waiting for a 
disconnected client to empty its buffer. This was causing a depletion of the 
stream event thread pool which lead to stream event stop.

What you are describing could be a normal behaviour. Jenkins Gerrit-Trigger use 
the stream-event command which only ends when you stop GT so it is normal that 
you have a lot of established ssh connections, one per Jenkins that use GT.

If you say that you have more connections than Jenkins instance with GT, there 
could be a problem on GT side. I remember we had some connection issues at one 
point with GT when using its connection watchdog feature but we fixed them and 
all the fixes are included in 2.12.0.

Original comment by huga...@gmail.com on 1 Dec 2014 at 1:56

GoogleCodeExporter commented 9 years ago
I'm discussing this in the mailing list. It seems that the problem is on the 
Jenkins side.

https://groups.google.com/d/msg/repo-discuss/NuFti4SVNQM/WAqBaV0bGFAJ

Original comment by gustavo@gnustavo.com on 1 Dec 2014 at 3:54