Closed SebastianStehle closed 6 years ago
that seems odd. do you have memory dump from the server? if logs cannot give us anything with good insight, maybe memory dump will, because this looks like something slowly grow bad in the server.
Or do you monitor the memory and cpu use on the server? that might be an easier starting point than memory dump
Sorry, it is kubernetes cluster. No idea how to get a memory dump from the process. I observe the monitor and cpu usage, but there is nothing special.
There was also a short discussion on gitter and @DarkCow said:
I have noticed increasing timeout's orleans-2.0.0, I have deployed orleans-2.0.0-ci.20180417.4 which includes the replacement threading model by @dVakulen which seems to solve the problem.
As far as I can tell, what I am seeing is that a thread will deadlock/stall, it is hard to find which Orleans function is stalling. But the new threading model seems to be a work stealing queue. #3792 Yes, been in production since 4/19. This is the first I am seeing other people having performance problems after hours of runtime...
Yeah, I went through the pipeline, everything in Orleans is really fast, but there is something else stalling the execution. Are you seeing warning logs about Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread and then Response did not arrive on time in 00:00:30 for message
I don't see these warning messages for first 12+ hours of runtime, then they start cropping up. orleans-2.0.0-ci.20180417.4 has appeared to of solved it and/or masked it...
I am seeing a similar issue. I have a 17VM cluster, running on Microsoft server 2012r2, net core 2.0, I am using the standard code-gen serializer, ADO.NET for clustering/storage/reminders. and SimpleMessageStreaming for streams.
The issue After 12+ hours of runtime, Orleans begins to have problems with grain calls stalling. These grain calls are fine T<12 hours, but stall T>12 hours. Some calls are fast, but they are interleaved with slow grain calls.
Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread
Response did not arrive on time in 00:00:30 for message ...
Things I've investigated
Return Task.FromResult( State );
Things I've tried
orleans-2.0.0-ci.20180417.4
which is post #3792
We do get similar issues as @DarkCow had explained as well. Especially timeouts which doesnt get resolved in 30sec, one major thing we had observed is when there are warnings of Queue wait time of xxx
start adding up, things start slowing down and timing out.
Our setup
microsoft/dotnet:2.0-runtime
which is LinuxObservations/info
Queue wait time of
log warning, and when that gets triggered, generally we need to reboot or things get slowing downPubSubRendezvousGrain
PubSubRendezvousGrain - Timeout exception log
PubSubRendezvousGrain - Queue wait time warning log
Machine network monitoring
Just to bring this in here as suggested in gitter, we have been experiencing these timeouts for a while now: https://github.com/dotnet/orleans/issues/4060
To recap:
8 VMs
1 silo/1 client on each
Client pulls from MSMQ, hardens to file, pushes to grain in silo and awaits processing, deletes file
Clients randomly 'stop' (anywhere from 1-10 days after startup) being able to talk to silos and throw Response did not arrive on time
errors
Clients don't all stop at the same time but when the first one stops, all others usually stop within 12 hours
This sometimes happens sporadically from within orleans itself (silo to silo), we can see this when running as a console app.
One of our infrastructure guys mentioned it may be network related in that something isn't being cleaned up/released properly. Whether that's ports or what I'm not sure.
Is there something we can do to provide you more data?
I'm catching up with this thread.
Just to make sure. What level of logging are you running Orleans at. We kind of assume that it's normally Info
and Debug
is only used during development/testing and only for investigations.
I can tell that all our load tests currently run on full .NET Framework, not on .NET Core. We are going to get to running on .NET Core, but it'll take us some time. That's one variable I see that seems to be consistently different from our test environment.
If you could share memory dumps from silos in the "slow" state, that might provide additional insight.
Until a few days ago I was using Info level, but then I changed it to Warning, because Info does not provide that much helpful information and generates a lot of noise.
@SebastianStehle @DarkCow @stephenlautier do you have log files which you're able to share with us? Above you mentioned that 12h is the point at which this happens, @DarkCow - is it suspiciously close to the 12 hour mark or is that a rough estimate?
A memory dump from a slow/stalled process may also prove useful.
I have now been running orleans-2.0.0-ci.20180417.4
which is post #3792 for ~2 weeks and the problem is solved or masked. But I will revert the staging environment back to the worst case scenario which was Orleans 2.0.0
&& .NET core 2.0
Will be back with data hopefully by Thursday.
@ReubenBond There is no correlation between time. Just averages out to every 12 hours I had to restart my system.
@sergeybykov I have noticed that the problem is solved/masked in .net framework over .net core.
We'll prioritize load testing on .NET Core higher to try to repro this in house. In the meantime, any relevant logs and memory dumps would be very helpful.
The longer the running time, the higher this value will be. When it reaches 30 or so, this problem will occur. The initial time is only about 0.3.
I experienced the same, especially methods like GetTotalActivationCount()
are super slow.
I wonder if this is dashboard related? Are people experiencing this without the dashboard enabled?
@richorama we dont have the dashboard installed, so its not related
@sergeybykov @ReubenBond How is this problem internal test? Our system is so urgent on the Internet now. Is there any plan to solve this problem?
@aqa510415008 if the situation is urgent, would you consider trying to use a nightly build from https://dotnet.myget.org/gallery/orleans-ci? @DarkCow has reported that the issue does not appear to affect the nightly builds.
If you are able to help by providing logs or a memory dump then that may help us resolve this sooner.
I ran the TestHost project in the OrleansDashboard project overnight. It's an Orleans 2.0.0, 3 silo local cluster with memory providers, running on an i7 quad core with 16GB RAM.
At the start of the run you can see calls to the management grain take <1ms.
After ~11 hours, the management grain is averaging 24ms per request.
It's interesting to note that other grains are not showing signs of running slowly.
Please ignore TestGrain
, it has a Task.Sleep to slow it down.
if someone can guide us on how we can get the memory dump that you'd like to have, we can help with that.
Again we are using docker (linux image)
Unsure how to do a memdump on Linux, but you ought to be able to execute a command on a running docker container to dump dotnet, which may require installing something first. AFAIK the dotnet images run on Alpine Linux, but I'm not familiar with that.
Status update: I was unable to reproduce this using the dashboard testhost app running a single silo. I'm trying again with a 2-silo app without the dashboard + a simple load generator. I'm running both a .NET Core and .NET 4.6.2 build to verify that this doesn't repro against Framework. I'm also taking memory dumps and logging statistics. Hopefully we can at least repro this tonight. I have one memory dump from @DarkCow which I've been taking a look at, but so far I don't see anything too suspicious: the threads seem to be functioning. I'll keep digging.
24 hours in using the new test app, I still see no change in latency in either .NET Core or .NET Framework. If there is some random event which causes this issue, then it hasn't been encountered on my host yet. More likely, there is some difference in environments (maybe providers, including telemetry, etc) which explains why I have not been able to reproduce the issue yet. Any thoughts on what those differences might be?
For now, the work-around would be to run the latest nightly, which apparently does not exhibit this problem.
The test app in its entirety is here: https://gist.github.com/ReubenBond/20aad47ce739c5d2a07f12a1a3d8c168 - maybe someone can spot what's missing. The fact that the silos/client are hosted in the same AppDomain might be an issue, but it seems unlikely (still communicating using TCP, but over localhost)
@ReubenBond How to give you a memory dump file?
@aqa510415008 a-rebond@microsoft.com
Any update on this issue? We encountered the same sympthoms after upgrading to Oreleans 2 (on service fabric) and had to rollback to Orleans 1.5 due to slowness that increases over time
@ReubenBond The memory dump file has been sent to you. Is there any problem?
Since upgrading to Orleans2 we have had general slowness in writing grain state. We tried to pin down the issue and we gauged the time it took for us to write jsons of about 1mb each to azure blob storage and the mean went up from about 100ms to 400ms in peak times. We looked in the blob storage to see if the data got any larger but it was about the same size (1mb). We suspected the upgrade to orleans2 to be the cause and when we downgraded back to orleans1 the write time got back to normal (this was the only change we made) so if the cause was something else the downgrade should not have effected this.
This is the graph of the writeStateAsync() time from one of our grains "expert-grain":
We noticed that after restarting the nodes the write time goes down for a few hours before going up again (we used a script to restart the nodes every 4 hours for that reason).
Any suggestions to how we can investigate this further?
@guysegal & others, I haven't been able to reproduce this. Could you please describe your applications, focusing on which features of Orleans your apps are using, eg: storage + which provider, streams + which provider, clustering + which provider, where the clients are hosted (eg, alongside ASP.NET Core?), or any other details which might help me to reproduce this.
In the meantime, the apparent workarounds are:
This issue has not been exposed during load tests on .NET Core yet.
You can capture a dump using Visual Studio (in Debug menu, Attach to Process, Break All, Save Dump As) or MemoScope.NET (free), or dotMemory (trial/paid), or Task Manager (right-click, Create dump file). I'm not sure how you capture a dump from within a docker container.
From the two sets of dumps which I have received, I see:
Task<Task>
instances from the dashboard. @aqa510415008 could you try running without the dashboard and produce another memory dump from a slowed silo? On your silo, I see at least 30 threads blocked in dashboard code as well as 72 threads blocked trying to write to the console via ConsoleLogger
. 64 of those threads are logging due to a request timeout. I also see lock contention on Catalog.activations
from two threads, one waiting on this lock and another in the for-loop inside. The thread in the for-loop is contending with an activation lock in ActivationData.GetRequestCount
, but that lock is taken by ActivationData.EnqueueMessage
, which is waiting on ConsoleLogger
contention. I can't see a deadlock here, but I see a lot of contention.As we know, not everyone experiencing this has the Dashboard installed, so we believe that is not the issue.
Again, any more information would be useful.
@Gabrn, my suspicion was that something in Azure Storage was causing increasing delays, but these two memory dumps that I have aren't using Azure Storage.
Are you able to attach a profiler to one of your slow silos to see where they are spending time? Can you create a minimal sample which demonstrates the slowdown you're experiencing? That would help me to diagnose it locally.
@aqa510415008 Can you share the memdump with the Dashboard team too? (I support @richorama to maintain it)
We made a lot of improvements to the Dashboard. I would test the new version 2.0.5.
@SebastianStehle Yes, but I don't have a dashboard team email
@ReubenBond I optimized the code,
1, the call is frequently added method [AlwaysInterleave]
2, Orleans dashboard upgrade 2.0.3 version
The situation is much better now no longer in the case of overtime, the phenomenon of running a busy week is growing slowly. I suspect that the blockage of the Orleans dashboard may have been caused by the #4178 problem.
We are seeing similar issues. The only thing that's changed recently is my code is now stable enough that it runs for longer periods of time without having to be restarted 😬 I'll post as much information as I can.
Environment: CentOS 7 Orleans 2.0.0 .NET Core 2.0 Not using the Orleans Dashboard
Silo was started at 2018-05-24 07:37:41.245 GMT, this issue is showing errors starting today (May 30, 2018) at 14:42 GMT
Note that I picked RegisterProducer mostly randomly, though it shows that this issue is impacting (my) user code, and Orleans code alike. Suspect it's either a memory pressure issue (unlikely) or a thread issue which describes more of what's been reported above.
The first spike occurs at 14:42. Here is what was going on in the logs at that time (I was only logging at INFO unfortunately)
2018-05-30 14:42:19.6069|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.6069|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7253|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 166 Milliseconds
2018-05-30 14:42:34.5032|INFO|Orleans.Runtime.Dispatcher|Intermediate NonExistentActivation for message Request S127.0.0.1:11111:264843461*grn/C57BE023/23f8a586@d4b6ca8c->S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8 #767033: , with Exception Orleans.Runtime.Catalog+NonExistentActivationException: Non-existent activation: [ActivationAddress: S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8, Full GrainId: [GrainId: *grn/716E8E94/17d96bb84c0400087bc072c199c798a306000000716e8e94+SMSProvider_player-0x6CA98AA3, IdCategory: KeyExtGrain, BaseTypeCode: 1903070868 (x716E8E94), PrimaryKey: 4c040008-6bb8-17d9-a398-c799c172c07b (x{0x4c040008,0x6bb8,0x17d9,{0xa3,0x98,0xc7,0x99,0xc1,0x72,0xc0,0x7b}}), UniformHashCode: 1823050403 (0x6CA98AA3), KeyExtension: SMSProvider_player], Full ActivationId: @4e2b474d003b04696baf0563a57e64b80000000000000000], grain type: Orleans.Streams.PubSubRendezvousGrain.
at Orleans.Runtime.Catalog.GetOrCreateActivation(ActivationAddress address, Boolean newPlacement, String grainType, String genericArguments, Dictionary`2 requestContextData, Task& activatedPromise)
at Orleans.Runtime.Dispatcher.ReceiveMessage(Message message)
2018-05-30 14:42:34.5114|INFO|Orleans.Runtime.Dispatcher|Trying to forward after Non-existent activation, ForwardCount = 0. Message Request S127.0.0.1:11111:264843461*grn/C57BE023/23f8a586@d4b6ca8c->S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8 #767033: .
2018-05-30 14:42:47.9953|INFO|Orleans.RuntimeSiloLogStatistics|Statistics: ^^^
App.Requests.Latency.Average.Millis=390.000
App.Requests.LatencyHistogram.Millis=[0.0512:0.1023]=1, [0.1024:0.2047]=4053, [0.2048:0.4095]=265245, [0.4096:0.8191]=30916, [0.8192:1.6383]=10944, [1.6384:3.2767]=26036, [3.2768:6.5535]=21542, [6.5536:13.1071]=95050, [13.1072:26.2143]=87603, [26.2144:52.4287]=93339, [52.4288:104.8575]=45083, [104.8576:209.7151]=20255, [209.7152:419.4303]=53587, [419.4304:838.8607]=2861, [838.8608:1677.7215]=1167, [1677.7216:3355.4431]=65, [3355.4432:6710.8863]=433, [6710.8864:13421.7727]=18, [13421.7728:26843.5455]=14, [26843.5456:53687.0911]=8837,
App.Requests.TimedOut.Current=8781, Delta=9
App.Requests.Total.Requests.Current=767049, Delta=563
Catalog.Activation.Collection.NumberOfCollections.Current=7201, Delta=4
Catalog.Activation.ConcurrentRegistrationAttempts.Current=23
Catalog.Activation.Created.Current=60317, Delta=75
Catalog.Activation.CurrentCount=161
Catalog.Activation.Destroyed.Current=60156, Delta=60
Catalog.Activation.FailedToActivate.Current=1
Catalog.Activation.NonExistentActivations.Current=299, Delta=1
Catalog.Activation.Shutdown.ViaCollection.Current=58772, Delta=59
Catalog.Activation.Shutdown.ViaDeactivateOnIdle.Current=1360, Delta=1
ConsistentRing.AverageRangePercentage=100.000
ConsistentRing.MyRange.RingDistance=x100000000
ConsistentRing.MyRange.RingPercentage=100.000
ConsistentRing.Ring=[S127.0.0.1:11111:264843461 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
ConsistentRing.RingSize=1
Directory.CacheSize=0
Directory.Lookups.Cache.HitRatio=0, Delta=0
Directory.Lookups.Cache.Issued.Current=0
Directory.Lookups.Cache.Successes.Current=0
Directory.Lookups.Full.Issued.Current=68663, Delta=81
Directory.Lookups.Local.Issued.Current=288788, Delta=330
Directory.Lookups.Local.Successes.Current=220125, Delta=249
Directory.Lookups.LocalDirectory.Issued.Current=357451, Delta=411
Directory.Lookups.LocalDirectory.Successes.Current=220132, Delta=249
Directory.Lookups.Remote.Received.Current=0
Directory.Lookups.Remote.Sent.Current=0
Directory.PartitionSize=162
Directory.Registrations.Issued.Current=1814, Delta=1
Directory.Registrations.Local.Current=1814, Delta=1
Directory.Registrations.Remote.Received.Current=0
Directory.Registrations.Remote.Sent.Current=0
Directory.Registrations.SingleAct.Issued.Current=60317, Delta=75
Directory.Registrations.SingleAct.Local.Current=60317, Delta=75
Directory.Registrations.SingleAct.Remote.Received.Current=0
Directory.Registrations.SingleAct.Remote.Sent.Current=0
Directory.Ring=[S127.0.0.1:11111:264843461/8A327320]
Directory.Ring.MyPortion.AverageRingPercentage=100.000
Directory.Ring.MyPortion.RingDistance=0
Directory.Ring.MyPortion.RingPercentage=0.000
Directory.Ring.MyPredecessors=[]
Directory.Ring.MySuccessors=[]
Directory.Ring.RingSize=1
Directory.UnRegistrations.Issued.Current=300, Delta=1
Directory.UnRegistrations.Local.Current=60432, Delta=61
Directory.UnRegistrations.Remote.Received.Current=0
Directory.UnRegistrations.Remote.Sent.Current=0
Directory.UnRegistrationsMany.Issued.Current=15217, Delta=11
Directory.UnRegistrationsMany.Remote.Received.Current=0
Directory.UnRegistrationsMany.Remote.Sent.Current=0
Directory.Validations.Cache.Received.Current=0
Directory.Validations.Cache.Sent.Current=0
Dispatcher.NewPlacement.Current=68656, Delta=81
Gateway.ConnectedClients.Current=1
Gateway.LoadShedding.Current=0
Gateway.Received.Current=50048, Delta=50
Gateway.Sent.Current=50047, Delta=50
Grain.Atlas.Grains.Collectors.CollectionGrain.Current=15
Grain.Atlas.Grains.Collectors.CollectionManagerGrain.Current=2
Grain.Atlas.Grains.Features.PlatformSearchGrain.Current=2
Grain.Atlas.Grains.Features.TitleSearchGrain.Current=2
Grain.Atlas.Grains.OracleGrain.Current=0
Grain.Atlas.Grains.PersonaGrain.Current=5
Grain.Atlas.Grains.PlatformClientGrain.Current=3
Grain.Atlas.Grains.PlayerGrain.Current=66
Grain.Atlas.Grains.TitleAliasGrain.Current=5
Grain.Atlas.Grains.TitleClientGrain.Current=2
Grain.Atlas.Grains.UserGrain.Current=0
Grain.Orleans.Runtime.Versions.VersionStoreGrain.Current=0
Grain.Orleans.Storage.MemoryStorageGrain.Current=10
Grain.Orleans.Streams.PubSubRendezvousGrain.Current=49
Membership.ActiveCluster=[S127.0.0.1:11111:264843461]
Membership.ActiveClusterSize=1
MessageCenter.ReceiveQueueLength=0
MessageCenter.SendQueueLength=0
Messaging.Acceptor.AllocatedSocketEventArgs.Current=1
Messaging.Acceptor.CheckedInSocketEventArgs.Current=0
Messaging.Acceptor.CheckedOutSocketEventArgs.Current=1
Messaging.Acceptor.InUseSocketEventArgs=1
Messaging.Expired.AtDispatch.Current=262, Delta=1
Messaging.Expired.AtInvoke.Current=373, Delta=3
Messaging.Expired.AtReceive.Current=0
Messaging.Expired.AtRespond.Current=49
Messaging.Expired.AtSend.Current=8098, Delta=5
Messaging.Processing.ActivationData.All=0
Messaging.Processing.Dispatcher.Processed.Errors.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.Errors.Direction.Request.Current=540, Delta=2
Messaging.Processing.Dispatcher.Processed.Errors.Direction.Response.Current=21
Messaging.Processing.Dispatcher.Processed.Ok.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.Ok.Direction.Request.Current=270516, Delta=314
Messaging.Processing.Dispatcher.Processed.Ok.Direction.Response.Current=229087, Delta=266
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.Request.Current=9955, Delta=10
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.Response.Current=0
Messaging.Processing.Dispatcher.Processed.Total.Current=510119, Delta=592
Messaging.Processing.Dispatcher.Received.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Received.Direction.Request.Current=280713, Delta=325
Messaging.Processing.Dispatcher.Received.Direction.Response.Current=229108, Delta=266
Messaging.Processing.Dispatcher.Received.OnActivationContext.Current=377700, Delta=428
Messaging.Processing.Dispatcher.Received.OnNullContext.Current=132121, Delta=163
Messaging.Processing.Dispatcher.Received.Total.Current=509821, Delta=591
Messaging.Processing.IGC.Forwarded.Current=9955, Delta=10
Messaging.Processing.IGC.ReRoute.Current=0
Messaging.Processing.IGC.Resent.Current=0
Messaging.Processing.IMA.Enqueued.ToActivationContex.Current=377700, Delta=428
Messaging.Processing.IMA.Enqueued.ToNullContex.Current=132121, Delta=163
Messaging.Processing.IMA.Enqueued.ToSystemTargetContex.Current=1067422, Delta=581
Messaging.Processing.IMA.Received.Current=1577243, Delta=1172
Messaging.Received.BatchSize.PerSocketDirection.GatewayToClient=0.000
Messaging.Received.BatchSize.PerSocketDirection.SiloToSilo=0.000
Messaging.Received.BatchSizeBytesHistogram.Bytes.PerSocketDirection.GatewayToClient=
Messaging.Received.BatchSizeBytesHistogram.Bytes.PerSocketDirection.SiloToSilo=
Messaging.Received.Bytes.Header.Current=6687540, Delta=6600
Messaging.Received.Bytes.Total.Current=8196889, Delta=8183
Messaging.Received.Direction.OneWay.Current=0
Messaging.Received.Direction.Request.Current=50048, Delta=50
Messaging.Received.Direction.Response.Current=0
Messaging.Received.Messages.From.Null.Current=50048, Delta=50
Messaging.Received.Messages.Total.Current=50048, Delta=50
Messaging.Received.MessageSizeHistogram.Bytes=[128:255]=50048,
Messaging.Rejected.Request.Current=1
Messaging.Rerouted.OneWay.Current=0
Messaging.Rerouted.Request.Current=50048, Delta=50
Messaging.Rerouted.Response.Current=0
Messaging.Sent.BatchSize.PerSocketDirection.GatewayToClient=0.000
Messaging.Sent.BatchSize.PerSocketDirection.SiloToSilo=0.000
Messaging.Sent.BatchSizeBytesHistogram.Bytes.PerSocketDirection.GatewayToClient=
Messaging.Sent.BatchSizeBytesHistogram.Bytes.PerSocketDirection.SiloToSilo=
Messaging.Sent.Bytes.Header.Current=7835662, Delta=7860
Messaging.Sent.Bytes.Total.Current=107711976, Delta=89189
Messaging.Sent.Direction.OneWay.Current=0
Messaging.Sent.Direction.Request.Current=0
Messaging.Sent.Direction.Response.Current=50047, Delta=50
Messaging.Sent.LocalMessages.Current=1577243, Delta=1172
Messaging.Sent.Messages.To.Null.Current=50047, Delta=50
Messaging.Sent.Messages.Total.Current=50047, Delta=50
Messaging.Sent.MessageSizeHistogram.Bytes=[128:255]=27873, [256:511]=2763, [512:1023]=440, [1024:2047]=7958, [4096:8191]=9064, [16384:32767]=1949,
Networking.Sockets.GatewayToClient.Duplex.Closed.Current=0
Networking.Sockets.GatewayToClient.Duplex.Opened.Current=1
Networking.Sockets.Silo.Receiving.Closed.Current=0
Networking.Sockets.Silo.Receiving.Opened.Current=0
Networking.Sockets.Silo.Sending.Closed.Current=0
Networking.Sockets.Silo.Sending.Opened.Current=0
Queues.QueueSize.Instantaneous.Scheduler.LevelOne=0
Reminders.AverageTardiness.Seconds=0 Secs
Reminders.NumberOfActiveReminders=0
Reminders.TicksDelivered.Current=0
Runtime.IsOverloaded=False
Runtime.Threads.AsynchAgent.Runtime.GrainDirectory.AdaptiveDirectoryCacheMaintainer`1.Current=1
Runtime.Threads.AsynchAgent.Runtime.GrainDirectory.GlobalSingleInstanceActivationMaintainer.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewayAcceptor.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewayClientCleanupAgent.Current=1
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewaySender.Current=1
Runtime.Threads.AsynchAgent.Runtime.Messaging.IncomingMessageAcceptor.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.IncomingMessageAgent.Current=3
Runtime.Threads.AsynchAgent.Runtime.Messaging.SiloMessageSender.Current=2
Runtime.Threads.AsynchAgent.Runtime.Scheduler.WorkerPoolThread.Current=9
Runtime.Threads.AsynchAgent.Runtime.Watchdog.Current=1
Runtime.Threads.AsynchAgent.TotalThreadsCreated.Current=21
Scheduler.NumLongQueueWaitTimes.Current=80
Scheduler.NumLongRunningTurns.Current=3
Scheduler.WorkItemGroupCount=175
Serialization.BufferPool.AllocatedBuffers.Current=250
Serialization.BufferPool.BuffersInPool=218
Serialization.BufferPool.CheckedInBuffers.Current=120000, Delta=115
Serialization.BufferPool.CheckedOutBuffers.Current=120032, Delta=115
Serialization.BufferPool.DroppedBuffers.Current=0
Serialization.BufferPool.DroppedTooLargeBuffers.Current=0
Serialization.BufferPool.InUse.AllocatedAndNotInPool_Buffers=32
Serialization.BufferPool.InUse.CheckedOutAndNotCheckedIn_Buffers=32
Silo.StartTime=2018-05-24 07:37:41.245 GMT
Storage.Activate.Errors.Current=0
Storage.Activate.Total.Current=0
Storage.Azure.Table.ServerBusy.Current=0
Storage.Clear.Errors.Current=0
Storage.Clear.Latency=0.0199104 Secs
Storage.Clear.Total.Current=1360, Delta=1
Storage.Read.Errors.Current=0
Storage.Read.Latency=0.0118666 Secs
Storage.Read.Total.Current=36787, Delta=47
Storage.Write.Errors.Current=1
Storage.Write.Latency=0.0199485 Secs
Storage.Write.Total.Current=53727, Delta=53
Streams.PubSub.Consumers.Added.Current=21484, Delta=28
Streams.PubSub.Consumers.Removed.Current=0
Streams.PubSub.Consumers.Total.Current=21484, Delta=28
Streams.PubSub.Producers.Added.Current=12121, Delta=14
Streams.PubSub.Producers.Removed.Current=3389, Delta=2
Streams.PubSub.Producers.Total.Current=8732, Delta=12
SystemTarget..Current=4
Transactions.Abort.Total.Current=0
Transactions.Commit.Completed.Current=0
Transactions.Commit.InDoubt.Current=0
Transactions.Commit.QueueLength.Current=0
Transactions.Commit.Request.Current=0
Transactions.Start.Completed.Current=0
Transactions.Start.QueueLength.Current=0
Transactions.Start.Request.Current=0
Watchdog.NumHealthChecks.Current=1812, Delta=1
I'm logging response times of every single grain method, so I can provide a graph for any Orleans grain method upon request.
@ReubenBond We're experiencing the issue quite badly too.
Could you please describe your applications, focusing on which features of Orleans your apps are using, eg: storage + which provider, streams + which provider, clustering + which provider, where the clients are hosted (eg, alongside ASP.NET Core?), or any other details which might help me to reproduce this.
From the two sets of dumps which I have received, I see:
Both co-host with ASP.NET Core/Kestrel (maybe only because of Orleans Dashboard?)
We're not using ASP.NET Core or Kestrel.
Both use AdoNet for something (one for clustering, persistence, & reminders, the other for reminders only
We don't use ADO.NET for anything in Orleans (e.g. grain state, reminders, etc.). However, we do use it (via Dapper) in some custom grains to fetch/store some DB values.
Both have Orleans Dashboard
We don't have it.
my suspicion was that something in Azure Storage was causing increasing delays, but these two memory dumps that I have aren't using Azure Storage.
I had this suspicion too, but recently also found it wrong. Initially each time I saw a slowdown, it was reported by a grain that writes to its state and also saw "Slow access to Azure Table" log messages. However, recently I noticed these slowdowns are happening in grains that don't use any state at all.
My best guess, which also correlates with some of your findings, is Thread/Task pressure due to some internal Orleans bug. We aren't under any considerable load and most interestingly, we had been running in this configuration for a few months until the issue first happened. We did publish code changes tho, but I cannot really correlate the issue with them - the first time the issue happened was in 2-4 weeks since previous release.
The release was to make several grains reentrant and to ensure no API/DB calls are made on the Orleans's thread pool.
Subjectively, the frequency of issue repeating increases over time - from once 1-2 weeks to several times a week. For instance, the last time I was fixing it happened 1 day after the previous time.
Usually the issue happens on a single silo. Once happened, it never goes back to normal until that silo is restarted. It also affects all grains and likely streaming as well running on the silo.
Our delays are quite aggressive - "Queue wait time ..." messages report delays on the order of 30-120 seconds.
@ilyalukyanov
We run Orleans 2.0.0-beta3
Can you try it with the latest release (2.0.3)? beta3 is fairly old, and we took quite a number of changes since it.
Even better, can you try a recent CI build from https://dotnet.myget.org/gallery/orleans-ci with the new scheduler rewritten by @dVakulen?
We are in the process of planning for 2.1, and this data would be crucial.
@sergeybykov yeah, upgrade to 2.0.3 is on the radar. I'll share updates when it's done.
I don't think I'll be able to try ci builds as we've not been able to reproduce the issue in test environments, so the version change would need to go to production.
Do you roughly know the timeframe for 2.1 at this stage? I appreciate it's not going to be accurate.
Even better, can you try a recent CI build from https://dotnet.myget.org/gallery/orleans-ci with the new scheduler rewritten by @dVakulen? We are in the process of planning for 2.1, and this data would be crucial.
We're also looking forward for this to be released :)
@sergeybykov is the new scheduler included in 2.0.3 ? because i have a similar issue with management grain(not necessarily may be some other grain) taking several seconds,then callback get delayed for 30 seconds and slowing to a halt.
No, you have to use the nightly builds.
did it solve your problem ?
Yes, it is much better with the nightly build.
the silo is not going down anymore ?
So far it is stable, but I also made performance improvements in my app and in the dashboard. So it is not easy to say, what had the biggest impact.
what version of the dashboard you are using for now?
@sergeybykov @DarkCow this is something i am struggling with and your description is exact " These grain calls are fine T<12 hours, but stall T>12 hours. Some calls are fast, but they are interleaved with slow grain calls.
Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread Response did not arrive on time in 00:00:30 for message ..."
but sometimes it go for days and at some point this pattern starts and i got this "Response did not arrive on time in 00:00:30 for message "
i am using orleans 1.4.2,it is fairly old,do you think upgrade to 2.03 would help or i should get a night build to get the new scheduler ?
@mohamedhammad I'd suggest to start with 2.0.3, and consider trying nightly if you need to.
took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000.
This is usually a sign of a blocking call (or calls): IO, call to web services, etc. Make sure they are all awaited properly. If grain threads get blocked by such code, timeouts inevitably happen soon after.
@sergeybykov (reviewed code for blockage,it is all awaited) will try 2.0.3 i think it's better not to use .net core until it is fully tested. should i use dot net framework ?
We are in the process of moving our nightly tests to .NET Core. A number of users seem to run totally fine on it. But it's your choice in the end, of course.
Hi,
it is a little bit vague, but I have the impression that Orleans gets slower over time.
My specs are:
I have a very simple grain, that just returns the snapshot (my own state system):
https://github.com/Squidex/squidex/blob/master/src/Squidex.Domain.Apps.Entities/Schemas/SchemaGrain.cs#L305
I use my own serializer with JSON.NET and my benchmarks show that the serialization takes 2-3 ms usually. I also tested it with 100x larger states that expected and the performance is great.
I added some profiling to the client side and I have experienced that the call takes 4-5 ms after I deploy it and several days later up to 200ms or even more (the max I have seen is 10seconds).
I also checked my MongoDB logs where I save all slow queries and there is nothing related.
I am a little bit lost.