Closed SebastianStehle closed 6 years ago
A quick update from me with recent findings.
We've migrated all our components to 2.0.3 fairly easily also fixing a couple of minor issues. Planning to deploy changes to production within a month. Thus, my new findings are still on 2.0.0-beta3.
Now that we started to gradually restart the cluster every 24 hours, the issue has become less of a problem, but unfortunately didn't go away completely, which was against our expectations.
It seems like the issue happens for System Targets and Grains pretty much independently, i.e. any of the following states is possible:
Often we don't let the issue develop and restart affected silo straight away, so potentially it might still eventually affect both system targets and grains if it didn't when the issue started.
Also I'm no longer sure the amount of time a silo is up is a prerequisite for the issue to happen. Now it seems more like some condition that is just more likely to happen if a silo runs longer. This is based on the following observation.
From ~12am to ~6am is almost dead time for our system when customer-related traffic drops to almost zero. At about 6am it starts to gradually develop to a certain constant (not high) load by 8-9am. Restarts are scheduled to 2am. A couple of times we noticed that the issue was starting on a silo already at 7-1:15am, i.e. after insignificant time from being restarted under insignificant load.
In the absence of customer-related load, the only things running are these two:
I doubt this information is very relevant since it's for an outdated version and is very high level, but I thought I would still drop it here in case it helps to shed some light on this vague issue.
@ilyalukyanov among others in this thread i'm having the same behaviour, i will try 2.0.3 if still having the same issue will get a night build in order to get the new scheduler, it is a supposed to help resolve that.
@ilyalukyanov @mohamedhammad Thank you both for investigating this! Your data for 2.0.3 and nightly CI builds is critical for us, as we still haven't been able to reproduce this issue in the house.
will definitely keep this thread updated.
Hi all. Writing as a non-expert dev but with a fair bit of experience now with Orleans 2.0.x, I had been struggling with stalling task processing. To date most of my work has been local (on a fast laptop) and I have been interested in stress-testing performance before deploying to real hardware etc. I found however that under sustained stress (> 80% CPU) for minutes, the silo would stop processing requests (CPU usage drops, grain time outs ensure. I should add though that the silo continues writing to the log -it doesn't hang). I varied the persistence (ADO.NET, Azure storage emulator, types of grains being persisted). There was no apparently no pattern to it, and the only solution would be to restart the silo. Naturally, I have scoured my code for Wait()s etc.
Solution: change from .NET Core to Framework for the silo process. Problem gone (or at least, not witnessed since). I can create grains without interruption at sustained high stress. Is there a difference in the task scheduler between the two?
If I can help by providing more information, I am happy to do so. (Not sure where to begin here, I'm afraid...)
Using Orleans 2.0.3, with ADO.NET and SQL Server Express for persistence.
@jballin82 could you share your test project with me?
@ReubenBond @jballin82 would you keep updating this thread with any new findings ,this is very important to me as i am upgrading to 2.0.3 and want to decide to use core or framework.
@ReubenBond I was afraid you'd ask that, but not immediately possible given some commercial constraints unfortunately. Any suggestions for a way forward? More diagnostic info perhaps?
@ReubenBond I have made an example which recreates the behaviour. Here are some screenshots of CPU activity:
While running, not yet stalled in Core: Note the erratic spikes.
Tasks stalled, in Core: CPU usage drops to near background level. Note the silo warnings about queue lengths...
Tasks starting, in Full: (it seems to ramp up steadily, fall, ramp again, and then stabilise at 100% after some minutes)
Tasks running in Full; occasional drops but rapid recovery to 100%:
I need to make the example self-contained in a VS solution for you but, once I do, how can I share this with you?
Jamie
Hugely appreciated, @jballin82. If you could email it to a-rebond@microsoft.com, then I'll take a look asap.
EDIT: accidentally closed - serves me right for commenting on mobile at 5am
@jballin82 Thanks for the important info to help demystify the issue!
Have you tried a nightly build from https://dotnet.myget.org/gallery/orleans-ci that has the new scheduler? Other people were saying that it solved the issue for them. It would be interesting to know if that's also true for your case.
I assume you were running both tests with exact same configuration: Server GC, etc. Can you confirm?
@sergeybykov Yes, Sergey, indeed - same config.
I've just sent @ReubenBond the example.
I haven't tried the nightly builds - Orleans is cutting edge enough for me, I'm not quite ready to resort to such measures ;-)
Looking forward to hearing how you get on, (but also hoping I haven't done something totally daft...)
I suppose the sample in a profiler really helps here, but maybe offering ideas to someone and speculating a bit about the CPU consumption, the situation resembles also contention to some variable. Might be ConcurrentDictionary
or a similar construct. I tried to check the latest merges like
https://github.com/dotnet/orleans/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+ConcurrentDictionary
https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+concurrentdictionary
https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+iocp
And so forth. I see in the CoreFx repo there are threading fixes specifically for SQL Server ADO.NET libraries and elsewhere too, but this problem seem to trouble others too.
@sergeybykov when will the nightly build will be published as a proper nuget package? I hope I understand this right..
@jballin82 Can you make the example public?
Update time.
I'm looking at this today. Apologies for the delay, I had issues running the repro from @jballin82 and had no Internet access - fixed now (bought a cellular modem while I wait for Telstra to get their act together.)
The repro uses ASP.NET and I thought that might be somehow important. I see that ASP.NET does create a whole bunch of threads and seems to slow things down on .NET Core significantly... However, ADO.NET is not required to reproduce this and removing the ADO.NET storage provider (swap for in-mem) and running @jballin82's example repo in Release mode allows me to reproduce this more quickly - within about 2 minutes, consistently.
I have not root-caused this yet. I see a very strong correlation between the failure and a (caught) InvalidOperationException
from BlockingCollection<T>.TryTakeFromAny(...)
(called from WorkQueue.Get()
) or BlockingCollection<T>.TryTake()
(called from WorkQueue.GetSystem()
).
The 'interesting' thing is that the InvalidOperationException
has the message "The underlying collection was modified from outside of the BlockingCollection<T>"
, but the underlying collection is ConcurrentBag<T>
and it's constructed directly in the constructor call to the BlockingCollection<T>
, i.e: new BlockingCollection<IWorkItem>(new ConcurrentBag<IWorkItem>())
. In other words, we don't have any direct access to the underlying collection.
So I'm suspicious of changes to .NET Core's ConcurrentBag<T>
implementation. I know a lot of performance work went into optimizing it for .NET Core. Of course, I haven't ruled out either that we're holding it wrong or that this is a red herring and the issue is in some totally separate location.
@jballin82 if you're happy for me to share a modified version of your repro, let me know.
EDIT: s/ConcurrentQueue/ConcurrentBag
xref https://github.com/dotnet/corefx/issues/30781 < I opened an issue in CoreFx for the BlockingCollection behavior.
Please do, @ReubenBond
Just in case someone wants too browse some changes: https://github.com/dotnet/corefx/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged+concurrentbag, and the localization variable of that particular error message is https://github.com/dotnet/corefx/blob/c4ebdbe5f5fbf53ec73b2c2e16f610387945151f/src/System.Collections.Concurrent/src/Resources/Strings.resx#L100 .
https://blogs.msdn.microsoft.com/dotnet/2017/06/07/performance-improvements-in-net-core/
both ConcurrentQueue
and ConcurrentBag were essentially completely rewritten for .NET Core 2.0, in PRs dotnet/corefx #14254 and dotnet/corefx #14126, respectively
To be clear, though, I have not confirmed that this is the root cause - it's just strongly correlated. I will continue with the investigation tomorrow.
Repro: StallingExample.zip
Note that different silo projects run on different package versions and you might need to delete one or more to make it build.
I can confirm that replacing the underlying collection in BlockingCollection<T>
to ConcurrentQueue<T>
rectifies the issue in the repro. With that, I consider that corefx bug to be the root cause of this issue. I can also confirm that this bug causes items which were added to the queue to become lost from the perspective of TryTake
, as demonstrated in this updated gist:
The question now is: do we release 2.0.4 with this fix or do we aim for 2.1?
I do not expect much impact (outside of fixing this bug) when switching from ConcurrentBag<T>
to ConcurrentQueue<T>
, but there may be subtleties which are not yet apparent to me.
It would be great to have it on 2.0.4..
Maybe is better to copy the prev implementation of BlockingCollection to orleans source and use it until there is a bug in .net core
I've opened #4736 with a workaround. It uses a runtime check so that we can preserve the current behavior when running on .NET Framework.
Another update: I needed to make a few changes to the build scripts so that we can create the right 2.0.4 packages. I'll open a PR for those changes tomorrow and try to start the release process so that we can attempt to release this week or early next week.
Hey @ReubenBond have you guys managed to decide on 2.0.4? Even a rough ETA would help. If more development needs to be done and you need hands, I'd be happy to help.
We are hoping to release it this week or early next. I misunderstood my teammates' schedules with the last message. Apologies for that.
@ReubenBond any update on this? We really need this fix on production.
We have published v2.0.4 to nuget.org. It includes the workaround for the ConcurrentBag issue. Please try it and chime in to let us know if it resolves the issue for you.
A future .NET Core 2.1 servicing release will include the fix for ConcurrentBag, but currently the workaround is needed.
EDIT: apologies for the silence, @pfrendo
For my part, I can confirm that version 2.0.4 solves a huge performance issue I had. it was appearing in less than one minute under load. Now everything works fine. Thanks!
Ditto on the performance improvements. Our system was slowing down by about +5 seconds per day. Now it's running smoothly.
Is anyone experiencing memory pressure issues? We started seeing out of memory exceptions after this update (after ~2 days of uptime). Could be my code.
Closing this because the issues appears to have been successfully resolved by the workaround in 2.0.4 😁
Thank you, everyone, so much for your patience and assistance in tracking this down and getting it resolved!
@lwansbrough please feel free to open an issue regarding memory pressure issues
@ReubenBond I can confirm it's working well 5-6 days+ with the same deployment and it's still working well. No memory leaks and exceptions that @lwansbrough mentioned
Working for us too! Thanks a lot for fixing!
Hi,
it is a little bit vague, but I have the impression that Orleans gets slower over time.
My specs are:
I have a very simple grain, that just returns the snapshot (my own state system):
https://github.com/Squidex/squidex/blob/master/src/Squidex.Domain.Apps.Entities/Schemas/SchemaGrain.cs#L305
I use my own serializer with JSON.NET and my benchmarks show that the serialization takes 2-3 ms usually. I also tested it with 100x larger states that expected and the performance is great.
I added some profiling to the client side and I have experienced that the call takes 4-5 ms after I deploy it and several days later up to 200ms or even more (the max I have seen is 10seconds).
I also checked my MongoDB logs where I save all slow queries and there is nothing related.
I am a little bit lost.