Orleans gets slower over time

SebastianStehle commented 6 years ago

Hi,

it is a little bit vague, but I have the impression that Orleans gets slower over time.

My specs are:

.NET Core 2.0
Running on Kubernetes in GCloud, 2 Nodes/Silos
Co-Hosted with ASP.NET Core

I have a very simple grain, that just returns the snapshot (my own state system):

https://github.com/Squidex/squidex/blob/master/src/Squidex.Domain.Apps.Entities/Schemas/SchemaGrain.cs#L305

I use my own serializer with JSON.NET and my benchmarks show that the serialization takes 2-3 ms usually. I also tested it with 100x larger states that expected and the performance is great.

I added some profiling to the client side and I have experienced that the call takes 4-5 ms after I deploy it and several days later up to 200ms or even more (the max I have seen is 10seconds).

I also checked my MongoDB logs where I save all slow queries and there is nothing related.

I am a little bit lost.

ilyalukyanov commented 6 years ago

A quick update from me with recent findings.

We've migrated all our components to 2.0.3 fairly easily also fixing a couple of minor issues. Planning to deploy changes to production within a month. Thus, my new findings are still on 2.0.0-beta3.

Now that we started to gradually restart the cluster every 24 hours, the issue has become less of a problem, but unfortunately didn't go away completely, which was against our expectations.

It seems like the issue happens for System Targets and Grains pretty much independently, i.e. any of the following states is possible:

only grains are affected
only system targets are affected
both are affected

Often we don't let the issue develop and restart affected silo straight away, so potentially it might still eventually affect both system targets and grains if it didn't when the issue started.

Also I'm no longer sure the amount of time a silo is up is a prerequisite for the issue to happen. Now it seems more like some condition that is just more likely to happen if a silo runs longer. This is based on the following observation.

From ~12am to ~6am is almost dead time for our system when customer-related traffic drops to almost zero. At about 6am it starts to gradually develop to a certain constant (not high) load by 8-9am. Restarts are scheduled to 2am. A couple of times we noticed that the issue was starting on a silo already at 7-1:15am, i.e. after insignificant time from being restarted under insignificant load.

In the absence of customer-related load, the only things running are these two:

One grain publishing to an SMS stream small messages every 5 seconds and 2 out-of-silo clients subscribed to the stream. Clients additionally call that grain every 10 minutes or so.
One reminder that triggers every 1 minute (is it too frequent?) and several less frequent reminders (from 10 minutes to monthly).

I doubt this information is very relevant since it's for an outdated version and is very high level, but I thought I would still drop it here in case it helps to shed some light on this vague issue.

mohamedhammad commented 6 years ago

@ilyalukyanov among others in this thread i'm having the same behaviour, i will try 2.0.3 if still having the same issue will get a night build in order to get the new scheduler, it is a supposed to help resolve that.

sergeybykov commented 6 years ago

@ilyalukyanov @mohamedhammad Thank you both for investigating this! Your data for 2.0.3 and nightly CI builds is critical for us, as we still haven't been able to reproduce this issue in the house.

mohamedhammad commented 6 years ago

will definitely keep this thread updated.

jballin82 commented 6 years ago

Hi all. Writing as a non-expert dev but with a fair bit of experience now with Orleans 2.0.x, I had been struggling with stalling task processing. To date most of my work has been local (on a fast laptop) and I have been interested in stress-testing performance before deploying to real hardware etc. I found however that under sustained stress (> 80% CPU) for minutes, the silo would stop processing requests (CPU usage drops, grain time outs ensure. I should add though that the silo continues writing to the log -it doesn't hang). I varied the persistence (ADO.NET, Azure storage emulator, types of grains being persisted). There was no apparently no pattern to it, and the only solution would be to restart the silo. Naturally, I have scoured my code for Wait()s etc.

Solution: change from .NET Core to Framework for the silo process. Problem gone (or at least, not witnessed since). I can create grains without interruption at sustained high stress. Is there a difference in the task scheduler between the two?

If I can help by providing more information, I am happy to do so. (Not sure where to begin here, I'm afraid...)

Using Orleans 2.0.3, with ADO.NET and SQL Server Express for persistence.

ReubenBond commented 6 years ago

@jballin82 could you share your test project with me?

mohamedhammad commented 6 years ago

@ReubenBond @jballin82 would you keep updating this thread with any new findings ,this is very important to me as i am upgrading to 2.0.3 and want to decide to use core or framework.

jballin82 commented 6 years ago

@ReubenBond I was afraid you'd ask that, but not immediately possible given some commercial constraints unfortunately. Any suggestions for a way forward? More diagnostic info perhaps?

jballin82 commented 6 years ago

@ReubenBond I have made an example which recreates the behaviour. Here are some screenshots of CPU activity:

While running, not yet stalled in Core: Note the erratic spikes. erratic cpu - ongoing - in core

Tasks stalled, in Core: CPU usage drops to near background level. Note the silo warnings about queue lengths... stalled tasks - core

Tasks starting, in Full: (it seems to ramp up steadily, fall, ramp again, and then stabilise at 100% after some minutes) starting cpu - full

Tasks running in Full; occasional drops but rapid recovery to 100%: steady state cpu - full

I need to make the example self-contained in a VS solution for you but, once I do, how can I share this with you?

Jamie

ReubenBond commented 6 years ago

Hugely appreciated, @jballin82. If you could email it to a-rebond@microsoft.com, then I'll take a look asap.

EDIT: accidentally closed - serves me right for commenting on mobile at 5am

sergeybykov commented 6 years ago

@jballin82 Thanks for the important info to help demystify the issue!

Have you tried a nightly build from https://dotnet.myget.org/gallery/orleans-ci that has the new scheduler? Other people were saying that it solved the issue for them. It would be interesting to know if that's also true for your case.

I assume you were running both tests with exact same configuration: Server GC, etc. Can you confirm?

jballin82 commented 6 years ago

@sergeybykov Yes, Sergey, indeed - same config.

I've just sent @ReubenBond the example.

I haven't tried the nightly builds - Orleans is cutting edge enough for me, I'm not quite ready to resort to such measures ;-)

Looking forward to hearing how you get on, (but also hoping I haven't done something totally daft...)

veikkoeeva commented 6 years ago

I suppose the sample in a profiler really helps here, but maybe offering ideas to someone and speculating a bit about the CPU consumption, the situation resembles also contention to some variable. Might be ConcurrentDictionary or a similar construct. I tried to check the latest merges like https://github.com/dotnet/orleans/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+ConcurrentDictionary

https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+concurrentdictionary

https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+iocp

And so forth. I see in the CoreFx repo there are threading fixes specifically for SQL Server ADO.NET libraries and elsewhere too, but this problem seem to trouble others too.

guysegal commented 6 years ago

@sergeybykov when will the nightly build will be published as a proper nuget package? I hope I understand this right..

SebastianStehle commented 6 years ago

@jballin82 Can you make the example public?

ReubenBond commented 6 years ago

Update time.

I'm looking at this today. Apologies for the delay, I had issues running the repro from @jballin82 and had no Internet access - fixed now (bought a cellular modem while I wait for Telstra to get their act together.)

The repro uses ASP.NET and I thought that might be somehow important. I see that ASP.NET does create a whole bunch of threads and seems to slow things down on .NET Core significantly... However, ADO.NET is not required to reproduce this and removing the ADO.NET storage provider (swap for in-mem) and running @jballin82's example repo in Release mode allows me to reproduce this more quickly - within about 2 minutes, consistently.

I have not root-caused this yet. I see a very strong correlation between the failure and a (caught) InvalidOperationException from BlockingCollection<T>.TryTakeFromAny(...) (called from WorkQueue.Get()) or BlockingCollection<T>.TryTake() (called from WorkQueue.GetSystem()).

The 'interesting' thing is that the InvalidOperationException has the message "The underlying collection was modified from outside of the BlockingCollection<T>", but the underlying collection is ConcurrentBag<T> and it's constructed directly in the constructor call to the BlockingCollection<T>, i.e: new BlockingCollection<IWorkItem>(new ConcurrentBag<IWorkItem>()). In other words, we don't have any direct access to the underlying collection.

So I'm suspicious of changes to .NET Core's ConcurrentBag<T> implementation. I know a lot of performance work went into optimizing it for .NET Core. Of course, I haven't ruled out either that we're holding it wrong or that this is a red herring and the issue is in some totally separate location.

@jballin82 if you're happy for me to share a modified version of your repro, let me know.

EDIT: s/ConcurrentQueue/ConcurrentBag

xref https://github.com/dotnet/corefx/issues/30781 < I opened an issue in CoreFx for the BlockingCollection behavior.

jballin82 commented 6 years ago

Please do, @ReubenBond

veikkoeeva commented 6 years ago

Just in case someone wants too browse some changes: https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+concurrentbag, and the localization variable of that particular error message is https://github.com/dotnet/corefx/blob/c4ebdbe5f5fbf53ec73b2c2e16f610387945151f/src/System.Collections.Concurrent/src/Resources/Strings.resx#L100 .

ReubenBond commented 6 years ago

https://blogs.msdn.microsoft.com/dotnet/2017/06/07/performance-improvements-in-net-core/

both ConcurrentQueue and ConcurrentBag were essentially completely rewritten for .NET Core 2.0, in PRs dotnet/corefx #14254 and dotnet/corefx #14126, respectively

To be clear, though, I have not confirmed that this is the root cause - it's just strongly correlated. I will continue with the investigation tomorrow.

Repro: StallingExample.zip

Note that different silo projects run on different package versions and you might need to delete one or more to make it build.

ReubenBond commented 6 years ago

I can confirm that replacing the underlying collection in BlockingCollection<T> to ConcurrentQueue<T> rectifies the issue in the repro. With that, I consider that corefx bug to be the root cause of this issue. I can also confirm that this bug causes items which were added to the queue to become lost from the perspective of TryTake, as demonstrated in this updated gist:

The question now is: do we release 2.0.4 with this fix or do we aim for 2.1?

I do not expect much impact (outside of fixing this bug) when switching from ConcurrentBag<T> to ConcurrentQueue<T>, but there may be subtleties which are not yet apparent to me.

guysegal commented 6 years ago

It would be great to have it on 2.0.4..

ifle commented 6 years ago

Maybe is better to copy the prev implementation of BlockingCollection to orleans source and use it until there is a bug in .net core

ReubenBond commented 6 years ago

I've opened #4736 with a workaround. It uses a runtime check so that we can preserve the current behavior when running on .NET Framework.

ReubenBond commented 6 years ago

Another update: I needed to make a few changes to the build scripts so that we can create the right 2.0.4 packages. I'll open a PR for those changes tomorrow and try to start the release process so that we can attempt to release this week or early next week.

ilyalukyanov commented 6 years ago

Hey @ReubenBond have you guys managed to decide on 2.0.4? Even a rough ETA would help. If more development needs to be done and you need hands, I'd be happy to help.

ReubenBond commented 6 years ago

We are hoping to release it this week or early next. I misunderstood my teammates' schedules with the last message. Apologies for that.

pfrendo commented 6 years ago

@ReubenBond any update on this? We really need this fix on production.

ReubenBond commented 6 years ago

We have published v2.0.4 to nuget.org. It includes the workaround for the ConcurrentBag issue. Please try it and chime in to let us know if it resolves the issue for you.

A future .NET Core 2.1 servicing release will include the fix for ConcurrentBag, but currently the workaround is needed.

EDIT: apologies for the silence, @pfrendo

srollinet commented 6 years ago

For my part, I can confirm that version 2.0.4 solves a huge performance issue I had. it was appearing in less than one minute under load. Now everything works fine. Thanks!

lwansbrough commented 6 years ago

Ditto on the performance improvements. Our system was slowing down by about +5 seconds per day. Now it's running smoothly.

lwansbrough commented 6 years ago

Is anyone experiencing memory pressure issues? We started seeing out of memory exceptions after this update (after ~2 days of uptime). Could be my code.

ReubenBond commented 6 years ago

Closing this because the issues appears to have been successfully resolved by the workaround in 2.0.4 😁

Thank you, everyone, so much for your patience and assistance in tracking this down and getting it resolved!

@lwansbrough please feel free to open an issue regarding memory pressure issues

pfrendo commented 6 years ago

@ReubenBond I can confirm it's working well 5-6 days+ with the same deployment and it's still working well. No memory leaks and exceptions that @lwansbrough mentioned

ilyalukyanov commented 6 years ago

Working for us too! Thanks a lot for fixing!

dotnet / orleans

Orleans gets slower over time #4505