Losing connection to Zookeeper intermittently

peter-perot commented 8 years ago

I have a strange issue with Zookeeper membership provider. We have different production deployments on different servers for different clients, but the issue is all the same. At some point in time (here the trouble starts at 19:31 GMT when exceptions are caught) silos start logging the following:

[2016-07-07 19:12:41.524 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1028043, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,6751208 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1028043 Status=RanToCompletion for 00:00:00.6720140 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:00.6720140. .> 
[2016-07-07 19:23:50.309 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.5312374 - possibly GC? We are now using total of 17MB memory. gc=273, 56, 4  
[2016-07-07 19:23:54.682 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.0312590 - possibly GC? We are now using total of 17MB memory. gc=273, 56, 4  
[2016-07-07 19:24:12.166 GMT    10  WARNING 101215  Scheduler.DeploymentLoadPublisherSystemTarge.WorkItemGroup  127.0.0.1:11125]    Task [Id=1096147, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535DeploymentLoadPublisherSystemTarge@S00000016] took elapsed time 0:00:00,2343609 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/3, ManagedThreadId=10, Executing Task Id=1096147 Status=RanToCompletion for 00:00:00.2812511 on WorkItem=System*WorkItemGroup:Name=DeploymentLoadPublisherSystemTarge,WorkGroupStatus=Running Executing for 00:00:00.2812511. .>    
[2016-07-07 19:29:33.089 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.2656301 - possibly GC? We are now using total of 16MB memory. gc=281, 57, 4  
[2016-07-07 19:29:34.089 GMT    11  WARNING 101410  GrainTimer  127.0.0.1:11125]    -GrainTimer.Membership.ProbeTimer. TimerCallbackHandler:Orleans.Runtime.GrainTimer+<>c__DisplayClass4->System.Threading.Tasks.Task <FromTimerCallback>b__3(System.Object) did not fire on time. Last fired at 2016-07-07 19:29:16.307 GMT, 00:00:17.7656445 since previous fire, should have fired after 00:00:10.  
[2016-07-07 19:29:34.995 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1126627, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,9221424 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1126627 Status=RanToCompletion for 00:00:00.9218753 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:01.8437511. .> 
[2016-07-07 19:29:36.948 GMT    11  WARNING 101215  Scheduler.DeploymentLoadPublisherSystemTarge.WorkItemGroup  127.0.0.1:11125]    Task [Id=1127868, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535DeploymentLoadPublisherSystemTarge@S00000016] took elapsed time 0:00:00,5270004 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1127868 Status=RanToCompletion for 00:00:00.5312488 on WorkItem=System*WorkItemGroup:Name=DeploymentLoadPublisherSystemTarge,WorkGroupStatus=Running Executing for 00:00:00.5312488. .> 
[2016-07-07 19:29:42.135 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.6406305 - possibly GC? We are now using total of 21MB memory. gc=281, 57, 4  
[2016-07-07 19:29:42.792 GMT     7  WARNING 101215  Scheduler.DeploymentLoadPublisherSystemTarge.WorkItemGroup  127.0.0.1:11125]    Task [Id=1129721, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535DeploymentLoadPublisherSystemTarge@S00000016] took elapsed time 0:00:00,6480078 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/0, ManagedThreadId=7, Executing Task Id=1129721 Status=RanToCompletion for 00:00:00.6562538 on WorkItem=System*WorkItemGroup:Name=DeploymentLoadPublisherSystemTarge,WorkGroupStatus=Running Executing for 00:00:00.6562538. .> 
[2016-07-07 19:29:43.417 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1129742, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,5923275 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1129742 Status=RanToCompletion for 00:00:00.5937525 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:00.5937525. .> 
[2016-07-07 19:29:49.651 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.2968814 - possibly GC? We are now using total of 22MB memory. gc=281, 57, 4  
[2016-07-07 19:30:00.151 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.3749995 - possibly GC? We are now using total of 23MB memory. gc=281, 57, 4  
[2016-07-07 19:30:03.354 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.0624317 - possibly GC? We are now using total of 24MB memory. gc=281, 57, 4  
[2016-07-07 19:30:08.761 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.7187539 - possibly GC? We are now using total of 25MB memory. gc=281, 57, 4  
[2016-07-07 19:30:09.526 GMT    11  WARNING 101215  Scheduler.DeploymentLoadPublisherSystemTarge.WorkItemGroup  127.0.0.1:11125]    Task [Id=1130633, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535DeploymentLoadPublisherSystemTarge@S00000016] took elapsed time 0:00:00,594088 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1130633 Status=RanToCompletion for 00:00:00.5937513 on WorkItem=System*WorkItemGroup:Name=DeploymentLoadPublisherSystemTarge,WorkGroupStatus=Running Executing for 00:00:00.6093768. .>  
[2016-07-07 19:30:13.057 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.0468779 - possibly GC? We are now using total of 14MB memory. gc=282, 57, 4  
[2016-07-07 19:30:22.042 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:03.0937533 - possibly GC? We are now using total of 15MB memory. gc=282, 57, 4  
[2016-07-07 19:30:26.151 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.1875084 - possibly GC? We are now using total of 15MB memory. gc=282, 57, 4  
[2016-07-07 19:30:34.870 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.5156287 - possibly GC? We are now using total of 16MB memory. gc=282, 57, 4  
[2016-07-07 19:30:41.604 GMT     9  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131048, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:04,2359869 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/2, ManagedThreadId=9, Executing Task Id=1131048 Status=RanToCompletion for 00:00:04.2343887 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:04.2466558. .> 
[2016-07-07 19:30:44.823 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.7343810 - possibly GC? We are now using total of 16MB memory. gc=282, 57, 4  
[2016-07-07 19:30:53.917 GMT     9  WARNING 101410  GrainTimer  127.0.0.1:11125]    -GrainTimer.Membership.ProbeTimer. TimerCallbackHandler:Orleans.Runtime.GrainTimer+<>c__DisplayClass4->System.Threading.Tasks.Task <FromTimerCallback>b__3(System.Object) did not fire on time. Last fired at 2016-07-07 19:30:27.401 GMT, 00:00:26.5156824 since previous fire, should have fired after 00:00:10.  
[2016-07-07 19:30:54.870 GMT     9  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131089, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,9413784 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/2, ManagedThreadId=9, Executing Task Id=1131089 Status=RanToCompletion for 00:00:00.9531245 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:00.9531245. .> 
[2016-07-07 19:31:01.202 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:05.3006842 - possibly GC? We are now using total of 17MB memory. gc=282, 57, 4  
[2016-07-07 19:31:00.589 GMT    11  ERROR   100657  MembershipOracle    127.0.0.1:11125]    !!!!!!!!!! ProcessTableUpdate failed    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.MembershipService.MembershipOracle.<<OnGetTableUpdateTimer>b__7d>d__7e.MoveNext()
[2016-07-07 19:31:04.013 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131159, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:03,4350591 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1131159 Status=RanToCompletion for 00:00:03.4399694 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:03.4399694. .> 
[2016-07-07 19:31:04.439 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:03.1467945 - possibly GC? We are now using total of 17MB memory. gc=282, 57, 4  
[2016-07-07 19:31:06.917 GMT    11  WARNING 101215  Scheduler.DeploymentLoadPublisherSystemTarge.WorkItemGroup  127.0.0.1:11125]    Task [Id=1131181, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535DeploymentLoadPublisherSystemTarge@S00000016] took elapsed time 0:00:00,5959713 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1131181 Status=RanToCompletion for 00:00:00.6051083 on WorkItem=System*WorkItemGroup:Name=DeploymentLoadPublisherSystemTarge,WorkGroupStatus=Running Executing for 00:00:00.6051083. .> 
[2016-07-07 19:31:20.465 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.7785583 - possibly GC? We are now using total of 18MB memory. gc=282, 57, 4  
[2016-07-07 19:31:22.261 GMT    11  WARNING 101410  GrainTimer  127.0.0.1:11125]    -GrainTimer.Membership.ProbeTimer. TimerCallbackHandler:Orleans.Runtime.GrainTimer+<>c__DisplayClass4->System.Threading.Tasks.Task <FromTimerCallback>b__3(System.Object) did not fire on time. Last fired at 2016-07-07 19:31:06.370 GMT, 00:00:15.8906594 since previous fire, should have fired after 00:00:10.  
[2016-07-07 19:31:36.401 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.3750072 - possibly GC? We are now using total of 19MB memory. gc=282, 57, 4  
[2016-07-07 19:31:37.511 GMT    11  WARNING 101410  GrainTimer  127.0.0.1:11125]    -GrainTimer.Membership.ProbeTimer. TimerCallbackHandler:Orleans.Runtime.GrainTimer+<>c__DisplayClass4->System.Threading.Tasks.Task <FromTimerCallback>b__3(System.Object) did not fire on time. Last fired at 2016-07-07 19:31:22.261 GMT, 00:00:15.2500313 since previous fire, should have fired after 00:00:10.  
[2016-07-07 19:31:39.401 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131471, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:01,9001899 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1131471 Status=RanToCompletion for 00:00:01.8906257 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:07.0625115. .> 
[2016-07-07 19:31:44.386 GMT     9  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131546, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:01,0329994 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/2, ManagedThreadId=9, Executing Task Id=1131546 Status=RanToCompletion for 00:00:01.0312484 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:01.0312484. .> 
[2016-07-07 19:31:51.308 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131690, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,6572843 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1131690 Status=RanToCompletion for 00:00:00.6562484 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:01.2812092. .> 
[2016-07-07 19:31:52.667 GMT    11  WARNING 101215  Scheduler.MembershipOracle.WorkItemGroup    127.0.0.1:11125]    Task [Id=1131751, Status=RanToCompletion] in WorkGroup [SystemTarget: S127.0.0.1:11125:205604535MembershipOracle@S0000000f] took elapsed time 0:00:00,5306118 for execution, which is longer than 00:00:00.2000000. Running on thread <Runtime.Scheduler.WorkerPoolThread/System.5, ManagedThreadId=11, Executing Task Id=1131751 Status=RanToCompletion for 00:00:00.5311890 on WorkItem=System*WorkItemGroup:Name=MembershipOracle,WorkGroupStatus=Running Executing for 00:00:00.5311890. .> 
[2016-07-07 19:31:52.808 GMT    11  ERROR   100657  MembershipOracle    127.0.0.1:11125]    !!!!!!!!!! ProcessTableUpdate failed    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.MembershipService.MembershipOracle.<<OnGetTableUpdateTimer>b__7d>d__7e.MoveNext()
[2016-07-07 19:32:00.651 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.0312512 - possibly GC? We are now using total of 21MB memory. gc=282, 57, 4  
[2016-07-07 19:32:06.745 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.1562542 - possibly GC? We are now using total of 21MB memory. gc=282, 57, 4  
[2016-07-07 19:32:21.448 GMT    49  WARNING 100150  Watchdog    127.0.0.1:11125]    .NET Runtime Platform stalled for 00:00:02.5000061 - possibly GC? We are now using total of 23MB memory. gc=282, 57, 4

On the client side (a WebAPI running with Katana/Owin) the following is logged:

OwinHost.exe Warning: 0 : [2016-07-07 17:15:29.662 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2846ms for sessionid 0x154d3b52d0a4f15]     
OwinHost.exe Warning: 0 : [2016-07-07 19:28:09.567 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2933ms for sessionid 0x154d3b52d0a565a]     
OwinHost.exe Warning: 0 : [2016-07-07 19:30:27.186 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 13050ms for sessionid 0x154d3b52d0a5673]    
OwinHost.exe Warning: 0 : [2016-07-07 19:31:14.855 GMT  WARNING     ClientCnxn  Unable to reconnect to ZooKeeper service, session 0x154d3b52d0a5673 has expired]    
Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
Exception = org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)

OwinHost.exe Error: 0 : [2016-07-07 19:31:24.231 GMT   196  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
[2016-07-07 19:31:24.231 GMT   196  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
OwinHost.exe Warning: 0 : [2016-07-07 19:32:42.370 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2820ms for sessionid 0x0]   
OwinHost.exe Warning: 0 : [2016-07-07 19:32:49.198 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 6156ms for sessionid 0x0]   
Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
Exception = org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)

OwinHost.exe Error: 0 : [2016-07-07 19:32:49.323 GMT   196  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
[2016-07-07 19:32:49.323 GMT   196  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
OwinHost.exe Warning: 0 : [2016-07-07 19:34:14.027 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 14531ms for sessionid 0x0]  
OwinHost.exe Warning: 0 : [2016-07-07 19:34:36.183 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 5936ms for sessionid 0x0]   
Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
Exception = org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)

OwinHost.exe Error: 0 : [2016-07-07 19:34:46.074 GMT    91  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
[2016-07-07 19:34:46.074 GMT    91  ERROR   100921  Messaging.GatewayManager    213.70.176.27:0]    !!!!!!!!!! Exception occurred during RefreshSnapshotLiveGateways_TimerCallback -> listProvider.GetGateways()    
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
   at org.apache.zookeeper.ZooKeeper.<getChildrenAsync>d__38.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<<ReadAll>b__14>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<TryOperation>d__63`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<>c__DisplayClass62_0`1.<<Using>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.zookeeper.ZooKeeper.<Using>d__62`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Host.ZooKeeperBasedMembershipTable.<GetGateways>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Messaging.GatewayManager.RefreshSnapshotLiveGateways_TimerCallback(Object context)
OwinHost.exe Warning: 0 : [2016-07-07 19:39:53.184 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2991ms for sessionid 0x154d3b52d0a56d1]     
OwinHost.exe Warning: 0 : [2016-07-07 19:40:56.309 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2991ms for sessionid 0x154d3b52d0a56dd]     
OwinHost.exe Warning: 0 : [2016-07-07 20:07:10.375 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2818ms for sessionid 0x154d3b52d0a584c]     
OwinHost.exe Warning: 0 : [2016-07-07 21:32:07.604 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 3006ms for sessionid 0x154d3b52d0a5cef]     
OwinHost.exe Warning: 0 : [2016-07-07 21:53:28.450 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2988ms for sessionid 0x154d3b52d0a5e1b]     
OwinHost.exe Warning: 0 : [2016-07-07 21:56:38.732 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2992ms for sessionid 0x154d3b52d0a5e47]     
OwinHost.exe Warning: 0 : [2016-07-07 21:57:41.857 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2996ms for sessionid 0x154d3b52d0a5e58]     
OwinHost.exe Warning: 0 : [2016-07-07 23:18:42.211 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2997ms for sessionid 0x154d3b52d0a62c3]     
OwinHost.exe Warning: 0 : [2016-07-08 01:15:11.788 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2996ms for sessionid 0x154d3b52d0a6919]     
OwinHost.exe Warning: 0 : [2016-07-08 01:18:21.007 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2987ms for sessionid 0x154d3b52d0a6948]     
OwinHost.exe Warning: 0 : [2016-07-08 01:19:24.116 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2997ms for sessionid 0x154d3b52d0a6954]     
OwinHost.exe Warning: 0 : [2016-07-08 01:20:27.210 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2958ms for sessionid 0x154d3b52d0a6961]     
OwinHost.exe Warning: 0 : [2016-07-08 02:51:29.471 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2990ms for sessionid 0x154d3b52d0a6e5b]     
OwinHost.exe Warning: 0 : [2016-07-08 03:18:47.115 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2988ms for sessionid 0x154d3b52d0a6fdc]     
OwinHost.exe Warning: 0 : [2016-07-08 04:26:26.342 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2994ms for sessionid 0x154d3b52d0a7388]     
OwinHost.exe Warning: 0 : [2016-07-08 04:30:34.609 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2994ms for sessionid 0x154d3b52d0a73c1]     
OwinHost.exe Warning: 0 : [2016-07-08 04:40:44.313 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 3051ms for sessionid 0x154d3b52d0a7453]     
OwinHost.exe Warning: 0 : [2016-07-08 06:15:32.983 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 4619ms for sessionid 0x154d3b52d0a7a2a]     
OwinHost.exe Warning: 0 : [2016-07-08 06:34:48.908 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 4286ms for sessionid 0x154d3b52d0a7b58]     
OwinHost.exe Warning: 0 : [2016-07-08 07:05:17.596 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2851ms for sessionid 0x154d3b52d0a7d38]     
OwinHost.exe Warning: 0 : [2016-07-08 08:25:15.068 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 3010ms for sessionid 0x154d3b52d0a821e]     
OwinHost.exe Warning: 0 : [2016-07-08 09:07:34.376 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2995ms for sessionid 0x154d3b52d0a84b3]     
OwinHost.exe Warning: 0 : [2016-07-08 10:05:17.977 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2994ms for sessionid 0x154d3b52d0a8824]     
OwinHost.exe Warning: 0 : [2016-07-08 10:08:26.759 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2969ms for sessionid 0x154d3b52d0a8859]     
OwinHost.exe Warning: 0 : [2016-07-08 10:10:31.447 GMT  WARNING     ClientCnxn  Client session timed out, have not heard from server in 2993ms for sessionid 0x154d3b52d0a8875]

At first I don't know if the "stalled" warnings have something to do with this issue. I have configured garbage collection according to the manuals:

<runtime>
  <gcServer enabled="true" />
  <gcConcurrent enabled="false" />
  <!-- ... and so on ... -->

Concerning Zookeeper it has been deployed out of the box with no further configuration (only the log path has been adjusted).

It seems that these exceptions have no real impact on the stability experienced by the user unless we install our WebAPI in IIS. In this case the AppPool shuts down after some hours and the event viewer shows some weird exception from org.apache namespace:

An unhandled exception occurred and the process was terminated.

Application ID: /LM/W3SVC/1/ROOT/portal-ressvc

Process ID: 48348

Exception: System.NullReferenceException

Message: Object reference not set to an instance of an object.

StackTrace:    at org.apache.utils.NonBlockingFileWriter.StreamWriterWrapper.Dispose()
   at org.apache.utils.NonBlockingFileWriter.StreamWriterWrapper.<WriteAsync>d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at org.apache.utils.NonBlockingFileWriter.<startLogTask>d__18.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<>c__DisplayClass2.<ThrowAsync>b__5(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()

Any idea what is going on here?

shayhatsor commented 8 years ago

@peter-perot, we have three issues here:

.NET Runtime Platform stalled - Orleans uses a timer to try to assert whether the underlying .net framework is stalled, a 1 second delay is considered a stall. Which sometimes means that GC is running. In your case, I'm not sure, as the warning says "We are now using total of 17MB memory. gc=273, 56, 4" (@sergeybykov might know more). And as you have noted, you haven't experienced any poor user experience. Are you running the silos on VMs ? Sometimes the clock on a VM runs super-fast or super-slow. Try to look at the logs of a silo that gets no traffic. If the .NET Runtime is actually stalled that frequently, some other evil is going on which probably doesn't stem from Orleans.
ConnectionLossException - An error that occurs when the ZK server fails to respond in a timely manner. In Orleans, I've hardcoded the timeout to 2 seconds. I could have made it configurable, but I've found that it's never the problem, just a symptom of a real problem. i.e. network issues or ZK server configuration issues. I suggest:
- Make sure the zookeeper cluster is continuously accessible from the silo machines.
- You're running with default ZK server configuration. Which includes a 10 concurrent connections per IP. It's alarming that it isn't enough, since Orleans should have only one connection to the ZK cluster at any give time. You can try setting it to maxClientCnxns=0 in all ZK servers zoo.cfg file, which disables this limit. Before doing that, you can consult the ZK servers logs at the times Orleans got the exception. If it's a server issue, you'd probably find some helpful correlated error there.
IIS weird exception - Well, I wrote that bit of code, so I think you've found a bug. It seems that the process doesn't have the credentials to create the log file and I haven't taken that into consideration. An easy quick fix for this is to add ZooKeeper.LogToFile = false; before starting the silo. Or adding the necessary credentials for the app pool to create the file. cool bug :smile:

Keep me posted.

peter-perot commented 8 years ago

Hi @shayhatsor, thanks for the fast reply. Yes, the whole things is running on a VM, so I will ignore the stall messages for now.

Concerning Zookeeper I set maxClientCnxnx=0 in zoo.cfg, restarted Zookeeper, but it didn't help. Over the weekend several connection-loss exceptions have been logged by Orleans. Unfortunately I cannot see what Zookeeper logged at the same time since Zookeeper is logging approximately 50 messages per minute (DEBUG, INFO...), and the log has been automatically truncated. I'm no Zookeeper expert since I only use it for Orleans membership protocol.

Our setup is one server (VM) running one Zookeeper instance, 5 silos for our QA/testing environment, and 5 silos for our production environment. I.e. 10 silos are talking to Zookeeper (with 2 different deplyoment IDs of course – one silo-farm for QA, one for production).

So I don't know where the connectivity problem with Zookeeper comes from – it's all on the same VM.

shayhatsor commented 8 years ago

Concerning Zookeeper I set maxClientCnxnx=0 in zoo.cfg, restarted Zookeeper, but it didn't help

Yep, in your setup it wouldn't. you have 10 silos talking to ZK, and the default is 10 concurrent connections per IP. so you can remove that settings.

Unfortunately I cannot see what Zookeeper logged at the same time since Zookeeper is logging approximately 50 messages per minute (DEBUG, INFO...), and the log has been automatically truncated

You can set the log to show only warning and above. it's log4net, see ZooKeeper Administrator's Guide. But I think I know what's the problem, so read on before making the change.

Our setup is one server

That's a high risk setup. If the machine goes down, production goes down. For production, one silo per machine is advisable. I've had good experience with Nomad for Orleans cluster deployment. For testing/QA, the setup is much less critical. you can run only one silo with in-memory membership, one silo with persistent membership or use the same setup as production. There's a performance penalty with having 5 silos of the same cluster on the same machine, and almost nothing is gained. Even if only one silo process crash, and not another, that will still affect other silos on the machine if the faulty silo is draining resources.

running one Zookeeper instance

That's probably the root of this problem. I advise that you move the ZK instance to a different machine. It shouldn't be contending for resources with the silos. Also, the minimum recommended ZK cluster size for production is 5 nodes, on 5 different machines (or VMs with constant resource allocation). I think that the connection between the silos and ZK just timed out on high load. Also, I've fixed the weird IIS bug you've mentioned, just update the ZooKeeperNetEx package.

shayhatsor commented 8 years ago

@peter-perot, I almost forgot. A running Orleans cluster doesn't require 100% responsiveness from the persistent membership storage, Orleans has robust mechanisms for retrying membership operations while keeping the service available. Having said that, I believe that if you apply my previous comment suggestions, you'll get a much coherent picture of what's going on.

peter-perot commented 8 years ago

@shayhatsor, thanks for your advice. I will configure our servers according to your advice and think the connection loss will be gone. As you said it's no problem for Orleans if responsiveness is not 100%. And thanks for the quick fix of the NullReferenceException.

I think I have understand the membership protocol and backing data structure, but one thing seems a little bit weird to me: Is it right that Orleans will never garbage collect "dead" entries? When I start a silo, then stopping it gracfully (with CTRL+C) and restarting it again I see a "dead" entry of the last generation of that silo. When I repeat this several times I see further "dead" entries of the generations which have been shut down.

My question is: When are those dead entries garbage collected, i.e. removed from the table/database/KV-store? As far as I can see there is no storage provider (e.g. Consul or Zookeepr) which is garbage collecting on its own; and the interface for membership control does not contain any clean-up methods (except the DeleteMembershipTableEntries method which seems never to be called since it would wipe out all data for a specific deployment-ID). Maybe an Orleans quirk?

shayhatsor commented 8 years ago

@peter-perot, you're welcome. About your question, it's an issue that has come up several times in the past. I remember when I implemented the ZK membership provider I also noticed that there's no garbage collection. At first it did seem like a quirk, but in time I figured out that it's a log of cluster stability. Considering that the table grows very slowly, it is actually a micro-optimization.

peter-perot commented 8 years ago

@shayhatsor: Yes, it's giving evidence about cluster stability. If it's getting too large, something must be awfully wrong. And in this case one can clean up the tables manually.

Closing this issue now. Thanks a lot!

dotnet / orleans

Losing connection to Zookeeper intermittently #1917