dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.07k stars 2.03k forks source link

Managed memory leak of ActivationId #6929

Closed talweiss1982 closed 3 years ago

talweiss1982 commented 3 years ago

Hi guys, I have been investigating a memory dump of one of our services that grown to 13GB of memory. We are using Orleans 3.3.0 running on net472 framework

The bottom of the dump shows (windbg): 00007ff909f4aef8 67509 450187368 System.Collections.Generic.HashSet1+Slot[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]][] 00007ff9095b6f20 21165530 507972720 Orleans.Runtime.ActivationId 00007ff9099e2618 21237239 509693736 System.WeakReference1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]] 00007ff9099e2e38 21237238 1019387424 System.Collections.Concurrent.ConcurrentDictionary2+Node[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[System.WeakReference1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]], mscorlib]] 00007ff9096539b0 21761379 1218637224 Orleans.Runtime.UniqueKey 000001c02523d1b0 13356465 7171352622 Free

Lets ignore the fragmentation (although its probably related), we have 1.2 GB of UniqueKey we have 1GB of dictionary nodes 0.5GB of WeakRefrences and 0.5GB of ActivationId

I have checked the GCRoot of some of those UniqueKey instances and here are the roots: Root #1 ... -> 000001c125a9ef40 Orleans.Runtime.ActivationDirectory -> 000001c125a9ef80 System.Collections.Concurrent.ConcurrentDictionary2[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions],[Orleans.Runtime.ActivationData, Orleans.Runtime]] -> 000001c125bf80b8 System.Collections.Concurrent.ConcurrentDictionary2+Tables[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions],[Orleans.Runtime.ActivationData, Orleans.Runtime]] -> 000001c265be8078 System.Collections.Concurrent.ConcurrentDictionary2+Node[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions],[Orleans.Runtime.ActivationData, Orleans.Runtime]][] -> 000001c125bf75e0 System.Collections.Concurrent.ConcurrentDictionary2+Node[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions],[Orleans.Runtime.ActivationData, Orleans.Runtime]] -> 000001c0e5af57e8 Orleans.Runtime.ActivationData -> 000001c0e5af5900 System.Collections.Generic.HashSet1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]] -> 000001c22cf46690 System.Collections.Generic.HashSet1+Slot[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]][] -> 000001c02635d5f8 Orleans.Runtime.ActivationId -> 000001c02635d5c0 Orleans.Runtime.UniqueKey

Root #2 000000a64a57ca70 00007ff9091ec0f3 Orleans.Interner2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]].FindOrCreate(System.__Canon, System.Func2<System.Canon,System.Canon>) rsp+30: 000000a64a57caa0 -> 000001c125a935d8 Orleans.Interner2[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]] -> 000001c125a93600 System.Collections.Concurrent.ConcurrentDictionary2[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[System.WeakReference1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]], mscorlib]] -> 000001c37ce19b08 System.Collections.Concurrent.ConcurrentDictionary2+Tables[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[System.WeakReference1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]], mscorlib]] -> 000001c000001020 System.Collections.Concurrent.ConcurrentDictionary2+Node[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[System.WeakReference1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]], mscorlib]][] -> 000001c306ecc9c0 System.Collections.Concurrent.ConcurrentDictionary2+Node[[Orleans.Runtime.UniqueKey, Orleans.Core.Abstractions],[System.WeakReference`1[[Orleans.Runtime.ActivationId, Orleans.Core.Abstractions]], mscorlib]] -> 000001c02635d5c0 Orleans.Runtime.UniqueKey

So root #2 originates from https://github.com/dotnet/orleans/blob/v3.3.0/src/Orleans.Core.Abstractions/IDs/ActivationId.cs#L16

I must say I think that the use of the interner class to hold unique keys defeats the purpose of the class (to act as a GC safe object pool) but lets not dwell on that, if the ActivationId were not rooted the interner should not have imploded (in the dump it had 30 million dictionary slots)

At first I couldn't locate the HashSet holding the activation ids as it seems you guys deleted the branch 3.3.0 and I was looking at the master at a commit with a status of update changelog for 3.3.0

I decompiled the assembly and found the renegade HashSet to be https://github.com/dotnet/orleans/blob/v3.3.0/src/Orleans.Runtime/Catalog/ActivationData.cs#L115

Notice that on https://github.com/dotnet/orleans/blob/v3.3.0/src/Orleans.Runtime/Catalog/ActivationData.cs#L305 you add the ActivationId of whomever sent this message to the above HashSet

Although when a request is marked as handled it is not removed, see: https://github.com/dotnet/orleans/blob/v3.3.0/src/Orleans.Runtime/Catalog/ActivationData.cs#L319

This is causing us to implode in term of memory and eventually crash.

I think that the scenario in which this will happen is if you have grains that are never been de-activated as they are very active but they are been invoked by different grains that have shorter lifespan thus they have unique ActivationIds in our case there are 15 silos with 65k Grains on each silo so it is very easy to accumulate a large number of unique ids in relatively short time (within days)

The fix here is a one liner at https://github.com/dotnet/orleans/blob/v3.3.0/src/Orleans.Runtime/Catalog/ActivationData.cs#L320 add: RunningRequestsSenders.Remove(message.SendingActivation);

I would have submitted a PR but I'm not sure how to add this to a Tag and master doesn't have this code and we really need this fix in Orleans 3.3.x our production is suffering from this.

ReubenBond commented 3 years ago

Hi Tal, I accidentally deleted your Gitter thread while trying to expand it - my apologies.

Thank you for the investigation! I've opened #6930 with the one-liner you suggested. We delete branches after making each release, since we tag each release. The 3.3.0 code is available under the v3.3.0 tag here: https://github.com/dotnet/orleans/tree/v3.3.0

Regarding the logistics of getting a fix into production, are you able to upgrade to 3.4.x?

talweiss1982 commented 3 years ago

Okay, Thanks Reuben.

talweiss1982 commented 3 years ago

I saw your response regarding a private build, I think that it is possible for us as we do it with the OrleansDashboard (since it stopped supporting netstandard2.0 ) If you could send me the package version I could try it out.

ReubenBond commented 3 years ago

I meant you cherry picking the 3.4.1 commit on top of 3.3.0 and building it yourself, @talweiss1982. Is there a blocking that would prevent you from upgrading to 3.4.1? We plan to release that today or tomorrow.

talweiss1982 commented 3 years ago

If it is planned to release this week than I'll wait. Thought it might take a while until it is released.

ReubenBond commented 3 years ago

@talweiss1982 v3.4.1 is up on nuget.org now. Release notes here: https://github.com/dotnet/orleans/releases/tag/v3.4.1

I'll close this issue for now