Orleans gets slower over time

SebastianStehle commented 6 years ago

Hi,

it is a little bit vague, but I have the impression that Orleans gets slower over time.

My specs are:

.NET Core 2.0
Running on Kubernetes in GCloud, 2 Nodes/Silos
Co-Hosted with ASP.NET Core

I have a very simple grain, that just returns the snapshot (my own state system):

https://github.com/Squidex/squidex/blob/master/src/Squidex.Domain.Apps.Entities/Schemas/SchemaGrain.cs#L305

I use my own serializer with JSON.NET and my benchmarks show that the serialization takes 2-3 ms usually. I also tested it with 100x larger states that expected and the performance is great.

I added some profiling to the client side and I have experienced that the call takes 4-5 ms after I deploy it and several days later up to 200ms or even more (the max I have seen is 10seconds).

I also checked my MongoDB logs where I save all slow queries and there is nothing related.

I am a little bit lost.

xiazen commented 6 years ago

that seems odd. do you have memory dump from the server? if logs cannot give us anything with good insight, maybe memory dump will, because this looks like something slowly grow bad in the server.

Or do you monitor the memory and cpu use on the server? that might be an easier starting point than memory dump

SebastianStehle commented 6 years ago

Sorry, it is kubernetes cluster. No idea how to get a memory dump from the process. I observe the monitor and cpu usage, but there is nothing special.

SebastianStehle commented 6 years ago

There was also a short discussion on gitter and @DarkCow said:

I have noticed increasing timeout's orleans-2.0.0, I have deployed orleans-2.0.0-ci.20180417.4 which includes the replacement threading model by @dVakulen which seems to solve the problem.

As far as I can tell, what I am seeing is that a thread will deadlock/stall, it is hard to find which Orleans function is stalling. But the new threading model seems to be a work stealing queue. #3792 Yes, been in production since 4/19. This is the first I am seeing other people having performance problems after hours of runtime...

Yeah, I went through the pipeline, everything in Orleans is really fast, but there is something else stalling the execution. Are you seeing warning logs about Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread and then Response did not arrive on time in 00:00:30 for message

I don't see these warning messages for first 12+ hours of runtime, then they start cropping up. orleans-2.0.0-ci.20180417.4 has appeared to of solved it and/or masked it...

DarkCow commented 6 years ago

I am seeing a similar issue. I have a 17VM cluster, running on Microsoft server 2012r2, net core 2.0, I am using the standard code-gen serializer, ADO.NET for clustering/storage/reminders. and SimpleMessageStreaming for streams.

The issue After 12+ hours of runtime, Orleans begins to have problems with grain calls stalling. These grain calls are fine T<12 hours, but stall T>12 hours. Some calls are fast, but they are interleaved with slow grain calls.

Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread
Response did not arrive on time in 00:00:30 for message ...

Things I've investigated

Grain calls are 100% async await, there is no blocking IO calls.
Trace grain calls to specific grain code. There is no correlation between timeout warnings and code complexity. Some calls are Return Task.FromResult( State );
RAM usage, CPU usage and there are no leaks and/or CPU spikes.
A timeout warning does not correlate to 100% CPU usage.
Investigate if my code is causing the stall, as far as I can tell there is no correlation between a stall and the grain call. It feels like work is queued to be run, but the TaskScheduler/Thread is blocked
Run through the Orleans calls on the stack, everything runs/returns very fast.

Things I've tried

Increase core/ram provisions for VMs,
- Didn't help because I am <10% CPU <50% RAM
Changing to .NET Framework 4.7.1
- This helps, so does this mean that .NET Core handles threads/locks differently?
- If Framework is handling blocked/stalled thread differently, it might explain it.
Upgrading to orleans-2.0.0-ci.20180417.4 which is post #3792
- This seems to of fixed/masked the problem. The insight I gathered is that it is a work stealing pool.
- Work stealing leads me to think the blocked/stalled thread is handled better

stephenlautier commented 6 years ago

We do get similar issues as @DarkCow had explained as well. Especially timeouts which doesnt get resolved in 30sec, one major thing we had observed is when there are warnings of Queue wait time of xxx start adding up, things start slowing down and timing out.

Our setup

AWS EC2 instances (8x)
3 clusters for each environment (dev, staging, prod)
Dynamo Db for persistence state
Persistent Streams are on DynamoDb as well (we are changing this to SmS on next deploy to test it out)
We have 4 apps using orleans
4-8 silo instances (not all apps are the same)
Currently using Docker Swarm (kubi soon)
Docker base image for silo is microsoft/dotnet:2.0-runtime which is Linux
Running on .NET Core 2
Orleans version 2.0.0-rc2

Observations/info

Our CPU usage are quite low <15%
Our network in/out seems a bit high (and most of the time on a particular node is ~2.5x higher, and 2 seems to have no load at all)
We have 4 apps, most of them get the same issue (especially 2 of them which have higher load)
All environments get the same problems, fortunately prod seems to be more stable - instance types are bigger 1 level bigger
We dont have too many consumers yet (so requests arent high)
We do need to reboot at least once weekly especially on 2 specific apps (on prod)
We have alerts for Queue wait time of log warning, and when that gets triggered, generally we need to reboot or things get slowing down
Timeouts doesnt seem to happen only in our grains but also on PubSubRendezvousGrain

PubSubRendezvousGrain - Timeout exception log

PubSubRendezvousGrain - Queue wait time warning log

Machine network monitoring

SebFerraro commented 6 years ago

Just to bring this in here as suggested in gitter, we have been experiencing these timeouts for a while now: https://github.com/dotnet/orleans/issues/4060

To recap:

8 VMs 1 silo/1 client on each Client pulls from MSMQ, hardens to file, pushes to grain in silo and awaits processing, deletes file Clients randomly 'stop' (anywhere from 1-10 days after startup) being able to talk to silos and throw Response did not arrive on time errors Clients don't all stop at the same time but when the first one stops, all others usually stop within 12 hours This sometimes happens sporadically from within orleans itself (silo to silo), we can see this when running as a console app.

One of our infrastructure guys mentioned it may be network related in that something isn't being cleaned up/released properly. Whether that's ports or what I'm not sure.

SebastianStehle commented 6 years ago

Is there something we can do to provide you more data?

sergeybykov commented 6 years ago

I'm catching up with this thread.

Just to make sure. What level of logging are you running Orleans at. We kind of assume that it's normally Info and Debug is only used during development/testing and only for investigations.

I can tell that all our load tests currently run on full .NET Framework, not on .NET Core. We are going to get to running on .NET Core, but it'll take us some time. That's one variable I see that seems to be consistently different from our test environment.

If you could share memory dumps from silos in the "slow" state, that might provide additional insight.

SebastianStehle commented 6 years ago

Until a few days ago I was using Info level, but then I changed it to Warning, because Info does not provide that much helpful information and generates a lot of noise.

ReubenBond commented 6 years ago

@SebastianStehle @DarkCow @stephenlautier do you have log files which you're able to share with us? Above you mentioned that 12h is the point at which this happens, @DarkCow - is it suspiciously close to the 12 hour mark or is that a rough estimate?

A memory dump from a slow/stalled process may also prove useful.

DarkCow commented 6 years ago

I have now been running orleans-2.0.0-ci.20180417.4 which is post #3792 for ~2 weeks and the problem is solved or masked. But I will revert the staging environment back to the worst case scenario which was Orleans 2.0.0 && .NET core 2.0 Will be back with data hopefully by Thursday.

@ReubenBond There is no correlation between time. Just averages out to every 12 hours I had to restart my system.

@sergeybykov I have noticed that the problem is solved/masked in .net framework over .net core.

sergeybykov commented 6 years ago

We'll prioritize load testing on .NET Core higher to try to repro this in house. In the meantime, any relevant logs and memory dumps would be very helpful.

lfzm commented 6 years ago

The longer the running time, the higher this value will be. When it reaches 30 or so, this problem will occur. The initial time is only about 0.3. w pl h 2f o_0h g 1 fq

SebastianStehle commented 6 years ago

I experienced the same, especially methods like GetTotalActivationCount() are super slow.

richorama commented 6 years ago

I wonder if this is dashboard related? Are people experiencing this without the dashboard enabled?

stephenlautier commented 6 years ago

@richorama we dont have the dashboard installed, so its not related

lfzm commented 6 years ago

@sergeybykov @ReubenBond How is this problem internal test? Our system is so urgent on the Internet now. Is there any plan to solve this problem?

ReubenBond commented 6 years ago

@aqa510415008 if the situation is urgent, would you consider trying to use a nightly build from https://dotnet.myget.org/gallery/orleans-ci? @DarkCow has reported that the issue does not appear to affect the nightly builds.

If you are able to help by providing logs or a memory dump then that may help us resolve this sooner.

richorama commented 6 years ago

I ran the TestHost project in the OrleansDashboard project overnight. It's an Orleans 2.0.0, 3 silo local cluster with memory providers, running on an i7 quad core with 16GB RAM.

At the start of the run you can see calls to the management grain take <1ms.

After ~11 hours, the management grain is averaging 24ms per request.

It's interesting to note that other grains are not showing signs of running slowly.

Please ignore TestGrain, it has a Task.Sleep to slow it down.

stephenlautier commented 6 years ago

if someone can guide us on how we can get the memory dump that you'd like to have, we can help with that.

Again we are using docker (linux image)

ReubenBond commented 6 years ago

Unsure how to do a memdump on Linux, but you ought to be able to execute a command on a running docker container to dump dotnet, which may require installing something first. AFAIK the dotnet images run on Alpine Linux, but I'm not familiar with that.

Status update: I was unable to reproduce this using the dashboard testhost app running a single silo. I'm trying again with a 2-silo app without the dashboard + a simple load generator. I'm running both a .NET Core and .NET 4.6.2 build to verify that this doesn't repro against Framework. I'm also taking memory dumps and logging statistics. Hopefully we can at least repro this tonight. I have one memory dump from @DarkCow which I've been taking a look at, but so far I don't see anything too suspicious: the threads seem to be functioning. I'll keep digging.

ReubenBond commented 6 years ago

24 hours in using the new test app, I still see no change in latency in either .NET Core or .NET Framework. If there is some random event which causes this issue, then it hasn't been encountered on my host yet. More likely, there is some difference in environments (maybe providers, including telemetry, etc) which explains why I have not been able to reproduce the issue yet. Any thoughts on what those differences might be?

For now, the work-around would be to run the latest nightly, which apparently does not exhibit this problem.

The test app in its entirety is here: https://gist.github.com/ReubenBond/20aad47ce739c5d2a07f12a1a3d8c168 - maybe someone can spot what's missing. The fact that the silos/client are hosted in the same AppDomain might be an issue, but it seems unlikely (still communicating using TCP, but over localhost)

lfzm commented 6 years ago

@ReubenBond How to give you a memory dump file?

jonathansant commented 6 years ago

https://github.com/dotnet/orleans/issues/4283

ReubenBond commented 6 years ago

@aqa510415008 a-rebond@microsoft.com

guysegal commented 6 years ago

Any update on this issue? We encountered the same sympthoms after upgrading to Oreleans 2 (on service fabric) and had to rollback to Orleans 1.5 due to slowness that increases over time

lfzm commented 6 years ago

@ReubenBond The memory dump file has been sent to you. Is there any problem?

Gabrn commented 6 years ago

Since upgrading to Orleans2 we have had general slowness in writing grain state. We tried to pin down the issue and we gauged the time it took for us to write jsons of about 1mb each to azure blob storage and the mean went up from about 100ms to 400ms in peak times. We looked in the blob storage to see if the data got any larger but it was about the same size (1mb). We suspected the upgrade to orleans2 to be the cause and when we downgraded back to orleans1 the write time got back to normal (this was the only change we made) so if the cause was something else the downgrade should not have effected this.

This is the graph of the writeStateAsync() time from one of our grains "expert-grain": screen shot 2018-05-20 at 15 11 21

We noticed that after restarting the nodes the write time goes down for a few hours before going up again (we used a script to restart the nodes every 4 hours for that reason).

Any suggestions to how we can investigate this further?

ReubenBond commented 6 years ago

@guysegal & others, I haven't been able to reproduce this. Could you please describe your applications, focusing on which features of Orleans your apps are using, eg: storage + which provider, streams + which provider, clustering + which provider, where the clients are hosted (eg, alongside ASP.NET Core?), or any other details which might help me to reproduce this.

In the meantime, the apparent workarounds are:

Use nightly builds: https://dotnet.myget.org/gallery/orleans-ci (new scheduler)
Use .NET Framework instead of .NET Core

This issue has not been exposed during load tests on .NET Core yet.

You can capture a dump using Visual Studio (in Debug menu, Attach to Process, Break All, Save Dump As) or MemoScope.NET (free), or dotMemory (trial/paid), or Task Manager (right-click, Create dump file). I'm not sure how you capture a dump from within a docker container.

From the two sets of dumps which I have received, I see:

Both co-host with ASP.NET Core/Kestrel (maybe only because of Orleans Dashboard?)
Both use AdoNet for something (one for clustering, persistence, & reminders, the other for reminders only
Both have Orleans Dashboard
@aqa510415008, one of your silos has around 71K Task<Task> instances from the dashboard. @aqa510415008 could you try running without the dashboard and produce another memory dump from a slowed silo? On your silo, I see at least 30 threads blocked in dashboard code as well as 72 threads blocked trying to write to the console via ConsoleLogger. 64 of those threads are logging due to a request timeout. I also see lock contention on Catalog.activations from two threads, one waiting on this lock and another in the for-loop inside. The thread in the for-loop is contending with an activation lock in ActivationData.GetRequestCount, but that lock is taken by ActivationData.EnqueueMessage, which is waiting on ConsoleLogger contention. I can't see a deadlock here, but I see a lot of contention.
@DarkCow I see 30 threads blocked in dashboard code in one of your memdumps, too. The other dumps look fine - just fairly idle silos.

As we know, not everyone experiencing this has the Dashboard installed, so we believe that is not the issue.

Again, any more information would be useful.

ReubenBond commented 6 years ago

@Gabrn, my suspicion was that something in Azure Storage was causing increasing delays, but these two memory dumps that I have aren't using Azure Storage.

Are you able to attach a profiler to one of your slow silos to see where they are spending time? Can you create a minimal sample which demonstrates the slowdown you're experiencing? That would help me to diagnose it locally.

SebastianStehle commented 6 years ago

@aqa510415008 Can you share the memdump with the Dashboard team too? (I support @richorama to maintain it)

SebastianStehle commented 6 years ago

We made a lot of improvements to the Dashboard. I would test the new version 2.0.5.

lfzm commented 6 years ago

@SebastianStehle Yes, but I don't have a dashboard team email

lfzm commented 6 years ago

@ReubenBond I optimized the code, 1, the call is frequently added method [AlwaysInterleave] 2, Orleans dashboard upgrade 2.0.3 version

The situation is much better now no longer in the case of overtime, the phenomenon of running a busy week is growing slowly. I suspect that the blockage of the Orleans dashboard may have been caused by the #4178 problem.

lwansbrough commented 6 years ago

We are seeing similar issues. The only thing that's changed recently is my code is now stable enough that it runs for longer periods of time without having to be restarted 😬 I'll post as much information as I can.

Environment: CentOS 7 Orleans 2.0.0 .NET Core 2.0 Not using the Orleans Dashboard

Silo was started at 2018-05-24 07:37:41.245 GMT, this issue is showing errors starting today (May 30, 2018) at 14:42 GMT

Note that I picked RegisterProducer mostly randomly, though it shows that this issue is impacting (my) user code, and Orleans code alike. Suspect it's either a memory pressure issue (unlikely) or a thread issue which describes more of what's been reported above.

The first spike occurs at 14:42. Here is what was going on in the logs at that time (I was only logging at INFO unfortunately)

2018-05-30 14:42:19.6069|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.6069|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 1000 took 10 Milliseconds
2018-05-30 14:42:19.7144|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 0 Milliseconds
2018-05-30 14:42:19.7253|INFO|Orleans.LifecycleSubject|Starting lifecycle stage 2000 took 166 Milliseconds
2018-05-30 14:42:34.5032|INFO|Orleans.Runtime.Dispatcher|Intermediate NonExistentActivation for message Request S127.0.0.1:11111:264843461*grn/C57BE023/23f8a586@d4b6ca8c->S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8 #767033: , with Exception Orleans.Runtime.Catalog+NonExistentActivationException: Non-existent activation: [ActivationAddress: S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8, Full GrainId: [GrainId: *grn/716E8E94/17d96bb84c0400087bc072c199c798a306000000716e8e94+SMSProvider_player-0x6CA98AA3, IdCategory: KeyExtGrain, BaseTypeCode: 1903070868 (x716E8E94), PrimaryKey: 4c040008-6bb8-17d9-a398-c799c172c07b (x{0x4c040008,0x6bb8,0x17d9,{0xa3,0x98,0xc7,0x99,0xc1,0x72,0xc0,0x7b}}), UniformHashCode: 1823050403 (0x6CA98AA3), KeyExtension: SMSProvider_player], Full ActivationId: @4e2b474d003b04696baf0563a57e64b80000000000000000], grain type: Orleans.Streams.PubSubRendezvousGrain.
   at Orleans.Runtime.Catalog.GetOrCreateActivation(ActivationAddress address, Boolean newPlacement, String grainType, String genericArguments, Dictionary`2 requestContextData, Task& activatedPromise)
   at Orleans.Runtime.Dispatcher.ReceiveMessage(Message message)
2018-05-30 14:42:34.5114|INFO|Orleans.Runtime.Dispatcher|Trying to forward after Non-existent activation, ForwardCount = 0. Message Request S127.0.0.1:11111:264843461*grn/C57BE023/23f8a586@d4b6ca8c->S127.0.0.1:11111:264843461*grn/716E8E94/99c798a3+SMSProvider_player@a57e64b8 #767033: .

2018-05-30 14:42:47.9953|INFO|Orleans.RuntimeSiloLogStatistics|Statistics: ^^^
App.Requests.Latency.Average.Millis=390.000
App.Requests.LatencyHistogram.Millis=[0.0512:0.1023]=1, [0.1024:0.2047]=4053, [0.2048:0.4095]=265245, [0.4096:0.8191]=30916, [0.8192:1.6383]=10944, [1.6384:3.2767]=26036, [3.2768:6.5535]=21542, [6.5536:13.1071]=95050, [13.1072:26.2143]=87603, [26.2144:52.4287]=93339, [52.4288:104.8575]=45083, [104.8576:209.7151]=20255, [209.7152:419.4303]=53587, [419.4304:838.8607]=2861, [838.8608:1677.7215]=1167, [1677.7216:3355.4431]=65, [3355.4432:6710.8863]=433, [6710.8864:13421.7727]=18, [13421.7728:26843.5455]=14, [26843.5456:53687.0911]=8837,
App.Requests.TimedOut.Current=8781,      Delta=9
App.Requests.Total.Requests.Current=767049,      Delta=563
Catalog.Activation.Collection.NumberOfCollections.Current=7201,      Delta=4
Catalog.Activation.ConcurrentRegistrationAttempts.Current=23
Catalog.Activation.Created.Current=60317,      Delta=75
Catalog.Activation.CurrentCount=161
Catalog.Activation.Destroyed.Current=60156,      Delta=60
Catalog.Activation.FailedToActivate.Current=1
Catalog.Activation.NonExistentActivations.Current=299,      Delta=1
Catalog.Activation.Shutdown.ViaCollection.Current=58772,      Delta=59
Catalog.Activation.Shutdown.ViaDeactivateOnIdle.Current=1360,      Delta=1
ConsistentRing.AverageRangePercentage=100.000
ConsistentRing.MyRange.RingDistance=x100000000
ConsistentRing.MyRange.RingPercentage=100.000
ConsistentRing.Ring=[S127.0.0.1:11111:264843461 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
ConsistentRing.RingSize=1
Directory.CacheSize=0
Directory.Lookups.Cache.HitRatio=0, Delta=0
Directory.Lookups.Cache.Issued.Current=0
Directory.Lookups.Cache.Successes.Current=0
Directory.Lookups.Full.Issued.Current=68663,      Delta=81
Directory.Lookups.Local.Issued.Current=288788,      Delta=330
Directory.Lookups.Local.Successes.Current=220125,      Delta=249
Directory.Lookups.LocalDirectory.Issued.Current=357451,      Delta=411
Directory.Lookups.LocalDirectory.Successes.Current=220132,      Delta=249
Directory.Lookups.Remote.Received.Current=0
Directory.Lookups.Remote.Sent.Current=0
Directory.PartitionSize=162
Directory.Registrations.Issued.Current=1814,      Delta=1
Directory.Registrations.Local.Current=1814,      Delta=1
Directory.Registrations.Remote.Received.Current=0
Directory.Registrations.Remote.Sent.Current=0
Directory.Registrations.SingleAct.Issued.Current=60317,      Delta=75
Directory.Registrations.SingleAct.Local.Current=60317,      Delta=75
Directory.Registrations.SingleAct.Remote.Received.Current=0
Directory.Registrations.SingleAct.Remote.Sent.Current=0
Directory.Ring=[S127.0.0.1:11111:264843461/8A327320]
Directory.Ring.MyPortion.AverageRingPercentage=100.000
Directory.Ring.MyPortion.RingDistance=0
Directory.Ring.MyPortion.RingPercentage=0.000
Directory.Ring.MyPredecessors=[]
Directory.Ring.MySuccessors=[]
Directory.Ring.RingSize=1
Directory.UnRegistrations.Issued.Current=300,      Delta=1
Directory.UnRegistrations.Local.Current=60432,      Delta=61
Directory.UnRegistrations.Remote.Received.Current=0
Directory.UnRegistrations.Remote.Sent.Current=0
Directory.UnRegistrationsMany.Issued.Current=15217,      Delta=11
Directory.UnRegistrationsMany.Remote.Received.Current=0
Directory.UnRegistrationsMany.Remote.Sent.Current=0
Directory.Validations.Cache.Received.Current=0
Directory.Validations.Cache.Sent.Current=0
Dispatcher.NewPlacement.Current=68656,      Delta=81
Gateway.ConnectedClients.Current=1
Gateway.LoadShedding.Current=0
Gateway.Received.Current=50048,      Delta=50
Gateway.Sent.Current=50047,      Delta=50
Grain.Atlas.Grains.Collectors.CollectionGrain.Current=15
Grain.Atlas.Grains.Collectors.CollectionManagerGrain.Current=2
Grain.Atlas.Grains.Features.PlatformSearchGrain.Current=2
Grain.Atlas.Grains.Features.TitleSearchGrain.Current=2
Grain.Atlas.Grains.OracleGrain.Current=0
Grain.Atlas.Grains.PersonaGrain.Current=5
Grain.Atlas.Grains.PlatformClientGrain.Current=3
Grain.Atlas.Grains.PlayerGrain.Current=66
Grain.Atlas.Grains.TitleAliasGrain.Current=5
Grain.Atlas.Grains.TitleClientGrain.Current=2
Grain.Atlas.Grains.UserGrain.Current=0
Grain.Orleans.Runtime.Versions.VersionStoreGrain.Current=0
Grain.Orleans.Storage.MemoryStorageGrain.Current=10
Grain.Orleans.Streams.PubSubRendezvousGrain.Current=49
Membership.ActiveCluster=[S127.0.0.1:11111:264843461]
Membership.ActiveClusterSize=1
MessageCenter.ReceiveQueueLength=0
MessageCenter.SendQueueLength=0
Messaging.Acceptor.AllocatedSocketEventArgs.Current=1
Messaging.Acceptor.CheckedInSocketEventArgs.Current=0
Messaging.Acceptor.CheckedOutSocketEventArgs.Current=1
Messaging.Acceptor.InUseSocketEventArgs=1
Messaging.Expired.AtDispatch.Current=262,      Delta=1
Messaging.Expired.AtInvoke.Current=373,      Delta=3
Messaging.Expired.AtReceive.Current=0
Messaging.Expired.AtRespond.Current=49
Messaging.Expired.AtSend.Current=8098,      Delta=5
Messaging.Processing.ActivationData.All=0
Messaging.Processing.Dispatcher.Processed.Errors.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.Errors.Direction.Request.Current=540,      Delta=2
Messaging.Processing.Dispatcher.Processed.Errors.Direction.Response.Current=21
Messaging.Processing.Dispatcher.Processed.Ok.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.Ok.Direction.Request.Current=270516,      Delta=314
Messaging.Processing.Dispatcher.Processed.Ok.Direction.Response.Current=229087,      Delta=266
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.Request.Current=9955,      Delta=10
Messaging.Processing.Dispatcher.Processed.ReRoute.Direction.Response.Current=0
Messaging.Processing.Dispatcher.Processed.Total.Current=510119,      Delta=592
Messaging.Processing.Dispatcher.Received.Direction.OneWay.Current=0
Messaging.Processing.Dispatcher.Received.Direction.Request.Current=280713,      Delta=325
Messaging.Processing.Dispatcher.Received.Direction.Response.Current=229108,      Delta=266
Messaging.Processing.Dispatcher.Received.OnActivationContext.Current=377700,      Delta=428
Messaging.Processing.Dispatcher.Received.OnNullContext.Current=132121,      Delta=163
Messaging.Processing.Dispatcher.Received.Total.Current=509821,      Delta=591
Messaging.Processing.IGC.Forwarded.Current=9955,      Delta=10
Messaging.Processing.IGC.ReRoute.Current=0
Messaging.Processing.IGC.Resent.Current=0
Messaging.Processing.IMA.Enqueued.ToActivationContex.Current=377700,      Delta=428
Messaging.Processing.IMA.Enqueued.ToNullContex.Current=132121,      Delta=163
Messaging.Processing.IMA.Enqueued.ToSystemTargetContex.Current=1067422,      Delta=581
Messaging.Processing.IMA.Received.Current=1577243,      Delta=1172
Messaging.Received.BatchSize.PerSocketDirection.GatewayToClient=0.000
Messaging.Received.BatchSize.PerSocketDirection.SiloToSilo=0.000
Messaging.Received.BatchSizeBytesHistogram.Bytes.PerSocketDirection.GatewayToClient=
Messaging.Received.BatchSizeBytesHistogram.Bytes.PerSocketDirection.SiloToSilo=
Messaging.Received.Bytes.Header.Current=6687540,      Delta=6600
Messaging.Received.Bytes.Total.Current=8196889,      Delta=8183
Messaging.Received.Direction.OneWay.Current=0
Messaging.Received.Direction.Request.Current=50048,      Delta=50
Messaging.Received.Direction.Response.Current=0
Messaging.Received.Messages.From.Null.Current=50048,      Delta=50
Messaging.Received.Messages.Total.Current=50048,      Delta=50
Messaging.Received.MessageSizeHistogram.Bytes=[128:255]=50048,
Messaging.Rejected.Request.Current=1
Messaging.Rerouted.OneWay.Current=0
Messaging.Rerouted.Request.Current=50048,      Delta=50
Messaging.Rerouted.Response.Current=0
Messaging.Sent.BatchSize.PerSocketDirection.GatewayToClient=0.000
Messaging.Sent.BatchSize.PerSocketDirection.SiloToSilo=0.000
Messaging.Sent.BatchSizeBytesHistogram.Bytes.PerSocketDirection.GatewayToClient=
Messaging.Sent.BatchSizeBytesHistogram.Bytes.PerSocketDirection.SiloToSilo=
Messaging.Sent.Bytes.Header.Current=7835662,      Delta=7860
Messaging.Sent.Bytes.Total.Current=107711976,      Delta=89189
Messaging.Sent.Direction.OneWay.Current=0
Messaging.Sent.Direction.Request.Current=0
Messaging.Sent.Direction.Response.Current=50047,      Delta=50
Messaging.Sent.LocalMessages.Current=1577243,      Delta=1172
Messaging.Sent.Messages.To.Null.Current=50047,      Delta=50
Messaging.Sent.Messages.Total.Current=50047,      Delta=50
Messaging.Sent.MessageSizeHistogram.Bytes=[128:255]=27873, [256:511]=2763, [512:1023]=440, [1024:2047]=7958, [4096:8191]=9064, [16384:32767]=1949,
Networking.Sockets.GatewayToClient.Duplex.Closed.Current=0
Networking.Sockets.GatewayToClient.Duplex.Opened.Current=1
Networking.Sockets.Silo.Receiving.Closed.Current=0
Networking.Sockets.Silo.Receiving.Opened.Current=0
Networking.Sockets.Silo.Sending.Closed.Current=0
Networking.Sockets.Silo.Sending.Opened.Current=0
Queues.QueueSize.Instantaneous.Scheduler.LevelOne=0
Reminders.AverageTardiness.Seconds=0 Secs
Reminders.NumberOfActiveReminders=0
Reminders.TicksDelivered.Current=0
Runtime.IsOverloaded=False
Runtime.Threads.AsynchAgent.Runtime.GrainDirectory.AdaptiveDirectoryCacheMaintainer`1.Current=1
Runtime.Threads.AsynchAgent.Runtime.GrainDirectory.GlobalSingleInstanceActivationMaintainer.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewayAcceptor.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewayClientCleanupAgent.Current=1
Runtime.Threads.AsynchAgent.Runtime.Messaging.GatewaySender.Current=1
Runtime.Threads.AsynchAgent.Runtime.Messaging.IncomingMessageAcceptor.Current=0
Runtime.Threads.AsynchAgent.Runtime.Messaging.IncomingMessageAgent.Current=3
Runtime.Threads.AsynchAgent.Runtime.Messaging.SiloMessageSender.Current=2
Runtime.Threads.AsynchAgent.Runtime.Scheduler.WorkerPoolThread.Current=9
Runtime.Threads.AsynchAgent.Runtime.Watchdog.Current=1
Runtime.Threads.AsynchAgent.TotalThreadsCreated.Current=21
Scheduler.NumLongQueueWaitTimes.Current=80
Scheduler.NumLongRunningTurns.Current=3
Scheduler.WorkItemGroupCount=175
Serialization.BufferPool.AllocatedBuffers.Current=250
Serialization.BufferPool.BuffersInPool=218
Serialization.BufferPool.CheckedInBuffers.Current=120000,      Delta=115
Serialization.BufferPool.CheckedOutBuffers.Current=120032,      Delta=115
Serialization.BufferPool.DroppedBuffers.Current=0
Serialization.BufferPool.DroppedTooLargeBuffers.Current=0
Serialization.BufferPool.InUse.AllocatedAndNotInPool_Buffers=32
Serialization.BufferPool.InUse.CheckedOutAndNotCheckedIn_Buffers=32
Silo.StartTime=2018-05-24 07:37:41.245 GMT
Storage.Activate.Errors.Current=0
Storage.Activate.Total.Current=0
Storage.Azure.Table.ServerBusy.Current=0
Storage.Clear.Errors.Current=0
Storage.Clear.Latency=0.0199104 Secs
Storage.Clear.Total.Current=1360,      Delta=1
Storage.Read.Errors.Current=0
Storage.Read.Latency=0.0118666 Secs
Storage.Read.Total.Current=36787,      Delta=47
Storage.Write.Errors.Current=1
Storage.Write.Latency=0.0199485 Secs
Storage.Write.Total.Current=53727,      Delta=53
Streams.PubSub.Consumers.Added.Current=21484,      Delta=28
Streams.PubSub.Consumers.Removed.Current=0
Streams.PubSub.Consumers.Total.Current=21484,      Delta=28
Streams.PubSub.Producers.Added.Current=12121,      Delta=14
Streams.PubSub.Producers.Removed.Current=3389,      Delta=2
Streams.PubSub.Producers.Total.Current=8732,      Delta=12
SystemTarget..Current=4
Transactions.Abort.Total.Current=0
Transactions.Commit.Completed.Current=0
Transactions.Commit.InDoubt.Current=0
Transactions.Commit.QueueLength.Current=0
Transactions.Commit.Request.Current=0
Transactions.Start.Completed.Current=0
Transactions.Start.QueueLength.Current=0
Transactions.Start.Request.Current=0
Watchdog.NumHealthChecks.Current=1812,      Delta=1

I'm logging response times of every single grain method, so I can provide a graph for any Orleans grain method upon request.

ilyalukyanov commented 6 years ago

@ReubenBond We're experiencing the issue quite badly too.

Could you please describe your applications, focusing on which features of Orleans your apps are using, eg: storage + which provider, streams + which provider, clustering + which provider, where the clients are hosted (eg, alongside ASP.NET Core?), or any other details which might help me to reproduce this.

We're using Azure Storage Provider: Table storage for grain persistence, reminders and membership and Blob storage for PubSub.
We're using streams with SMS provider. We use out-of-silo stream clients.
We're using reminders
We have 3 silos in the cluster. Both silos and all clients are dotnet core 2 linux applications. Hosted in Docker in Kubernetes.
We run Orleans 2.0.0-beta3

From the two sets of dumps which I have received, I see:

Both co-host with ASP.NET Core/Kestrel (maybe only because of Orleans Dashboard?)

We're not using ASP.NET Core or Kestrel.

Both use AdoNet for something (one for clustering, persistence, & reminders, the other for reminders only

We don't use ADO.NET for anything in Orleans (e.g. grain state, reminders, etc.). However, we do use it (via Dapper) in some custom grains to fetch/store some DB values.

Both have Orleans Dashboard

We don't have it.

my suspicion was that something in Azure Storage was causing increasing delays, but these two memory dumps that I have aren't using Azure Storage.

I had this suspicion too, but recently also found it wrong. Initially each time I saw a slowdown, it was reported by a grain that writes to its state and also saw "Slow access to Azure Table" log messages. However, recently I noticed these slowdowns are happening in grains that don't use any state at all.

My best guess, which also correlates with some of your findings, is Thread/Task pressure due to some internal Orleans bug. We aren't under any considerable load and most interestingly, we had been running in this configuration for a few months until the issue first happened. We did publish code changes tho, but I cannot really correlate the issue with them - the first time the issue happened was in 2-4 weeks since previous release.

The release was to make several grains reentrant and to ensure no API/DB calls are made on the Orleans's thread pool.

Subjectively, the frequency of issue repeating increases over time - from once 1-2 weeks to several times a week. For instance, the last time I was fixing it happened 1 day after the previous time.

Usually the issue happens on a single silo. Once happened, it never goes back to normal until that silo is restarted. It also affects all grains and likely streaming as well running on the silo.

Our delays are quite aggressive - "Queue wait time ..." messages report delays on the order of 30-120 seconds.

sergeybykov commented 6 years ago

@ilyalukyanov

We run Orleans 2.0.0-beta3

Can you try it with the latest release (2.0.3)? beta3 is fairly old, and we took quite a number of changes since it.

Even better, can you try a recent CI build from https://dotnet.myget.org/gallery/orleans-ci with the new scheduler rewritten by @dVakulen?

We are in the process of planning for 2.1, and this data would be crucial.

ilyalukyanov commented 6 years ago

@sergeybykov yeah, upgrade to 2.0.3 is on the radar. I'll share updates when it's done.

I don't think I'll be able to try ci builds as we've not been able to reproduce the issue in test environments, so the version change would need to go to production.

Do you roughly know the timeframe for 2.1 at this stage? I appreciate it's not going to be accurate.

shlomiw commented 6 years ago

Even better, can you try a recent CI build from https://dotnet.myget.org/gallery/orleans-ci with the new scheduler rewritten by @dVakulen? We are in the process of planning for 2.1, and this data would be crucial.

We're also looking forward for this to be released :)

mohamedhammad commented 6 years ago

@sergeybykov is the new scheduler included in 2.0.3 ? because i have a similar issue with management grain(not necessarily may be some other grain) taking several seconds,then callback get delayed for 30 seconds and slowing to a halt.

SebastianStehle commented 6 years ago

No, you have to use the nightly builds.

mohamedhammad commented 6 years ago

did it solve your problem ?

SebastianStehle commented 6 years ago

Yes, it is much better with the nightly build.

mohamedhammad commented 6 years ago

the silo is not going down anymore ?

SebastianStehle commented 6 years ago

So far it is stable, but I also made performance improvements in my app and in the dashboard. So it is not easy to say, what had the biggest impact.

mohamedhammad commented 6 years ago

what version of the dashboard you are using for now?

mohamedhammad commented 6 years ago

@sergeybykov @DarkCow this is something i am struggling with and your description is exact " These grain calls are fine T<12 hours, but stall T>12 hours. Some calls are fast, but they are interleaved with slow grain calls.

Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread Response did not arrive on time in 00:00:30 for message ..."

but sometimes it go for days and at some point this pattern starts and i got this "Response did not arrive on time in 00:00:30 for message "

i am using orleans 1.4.2,it is fairly old,do you think upgrade to 2.03 would help or i should get a night build to get the new scheduler ?

sergeybykov commented 6 years ago

@mohamedhammad I'd suggest to start with 2.0.3, and consider trying nightly if you need to.

took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000.

This is usually a sign of a blocking call (or calls): IO, call to web services, etc. Make sure they are all awaited properly. If grain threads get blocked by such code, timeouts inevitably happen soon after.

mohamedhammad commented 6 years ago

@sergeybykov (reviewed code for blockage,it is all awaited) will try 2.0.3 i think it's better not to use .net core until it is fully tested. should i use dot net framework ?

sergeybykov commented 6 years ago

We are in the process of moving our nightly tests to .NET Core. A number of users seem to run totally fine on it. But it's your choice in the end, of course.

dotnet / orleans

Orleans gets slower over time #4505