Azure / azure-amqp

AMQP C# library
Other
94 stars 70 forks source link

High object retention via timers in finalization queue #233

Closed aarondandy closed 1 year ago

aarondandy commented 1 year ago

While investigating OutOfMemoryExceptions in our production environment, I noticed a large number of CancellationTokenSource+CallbackNode chains being retained. Sometimes these chains can be tens of thousands of instances long and retain gigabytes of memory in total. When investigating common retention paths in these memory dumps, the majority of them look like they originate from a TimerHolder in the finalization queue with a retention path involving ReceivingAmqpLink+ReceiveAsyncResult . Our systems, especially hosted in Azure Container Apps, are unable to work through this finalization queue and reclaim memory in time. I'm having some trouble following this codebase so I don't know what the root cause could be, but I saw a commit from 2020 on the development master branch involving timer disposal which might be relevant. If that commit is for a known Dispose issue in the current release, would it be possible to get it on the current active hotfix branch as well?

Relevant commit: https://github.com/Azure/azure-amqp/commit/9fad6969e8330ae698775e8c7147d29c02229562

Retention path:

jsquire commented 1 year ago

//fyi: @xinchen10, @JoshLove-msft, @danielmarbach

danielmarbach commented 1 year ago

Probably from here?

https://github.com/Azure/azure-amqp/blob/c6242a5dad1a1638dfee53282e08c8440913e8f7/src/Fx/Singleton.cs#L55

CancellationTokenSource with a timeout need to be disposed after usage because they are associated with a timer. Happy to send a PR

danielmarbach commented 1 year ago

https://github.com/Azure/azure-amqp/pull/234 would fix at least that usage.

aarondandy commented 1 year ago

Good find @danielmarbach . I'm not sure if that is the cause of my specific retention issue, but it might be 🤷‍♂️. It all depends on if the ASB client uses that method and I'm just too overburdened at the moment to investigate. It definitely looks like an issue though!

Regarding contributing a fix for the explicit timer dispose: I just don't have enough confidence I wouldn't do harm by touching those code paths. I can't wrap my head around the object lifetimes in that area and wouldn't know how to property test or validate any of it.

xinchen10 commented 1 year ago

@aarondandy how many ReceivingAmqpLink+ReceiveAsyncResult objects did you see? Also 10,000+? The receiving link always disposes the cancellation token registration when the async result is completed. If you saw roughly the same amount of ReceiveAsyncResult objects as CallbackNode, it would mean that the AMQP receiving link does not dispose the registration token in some cases. Otherwise it could be leaks from some other places where registration tokens are created, especially if you use the same cancellation token source for all the calls.

aarondandy commented 1 year ago

The crazy amount of CallbackNode could be some EF, Polly, or other shenanigans we are up to. The core issue for us though in this context is just the TimerHolder that is hanging on to all of them. I don't have the dumps anymore, but there were maybe only 100-1000 TimerHolder instances in the finalization queue. Also, possibly related, I think there are some framework level optimizations with CallbackNodes which may be contributing to them growing so large but 🤷‍♂️. Even if there were fewer CallbackNodes I don't think our finalization queues would be processed fast enough to rely on finalization over object disposal.

Also, worth noting that I added a background thread to force the finalization queue to be processed and that has greatly reduced the symptoms of this issue for us.

xinchen10 commented 1 year ago

The timers are disposed in the receiving link. They show up in finalization queue because it has a finalizer method, but the method should be skipped after Dispose is called. Without a dump it is going to be difficult to find the root cause, unless we can reproduce it by simulating what your code does.

aarondandy commented 1 year ago

They show up in finalization queue because it has a finalizer method

Could be. I'm making the assumption that uses GC.SuppressFinalize as part of the dispose pattern and won't end up in the finalization queue as a result when explicitly disposed, but 🤷‍♂️.

Without a dump

I'm not comfortable sharing my dumps, but I can undo my finalization hack and let it run for a few days and we can hop on a call and pick through the pile together if you want. I haven't been able to make a minimum reproduction of this as I don't know of a good way to prevent the finalization queue from being processed timely on my local machine. I also just don't have time to dig into the internals of the library sadly.

how do you use the amqp library

The Azure service bus library

what is the timeout values when you call ReceiveAsync

I'm not really familiar with how that works but if this is related to the older Close issue I would guess they use 60 seconds

do most calls return messages or nothing

I think about each second we have at least one message, so it wouldn't be waiting too long. This is not a problem in our test environment where most ReceiveAsync calls would not retrieve a message, but that also means it has plenty of time to clean up the finalization queue.

do you create a new cancellation token

EF, Polly, and some of my code may create some linked CancellationTokenSource but all of them should lead back to the CancellationToken we get from ProcessMessageAsync in Azure Service Bus.

aarondandy commented 1 year ago

I have a new memory dump with over 700mb retained by a Timer made by ReceivingAmqlLink+ReceiveAsyncResult. I can see that the TimerQueueTimer._canceled field is true. I can also see that the CancellationTokenSource+CallbackNode's CallbackState held by the cancellationTokenRegistration is no longer pointing to the ReceiveAsyncResult callback state, but a completely different object instance's state. I think this is enough to indicate that both timer.Dispose() and cancellationTokenRegistration.Dispose() have been called here as expected.

So, maybe even though the root is the Timer made by ReceiveAsyncResult the reason for this CallbackNode chain being so large is unrelated to the ASB or AMQP libraries like I thought. Looking at the remaining chain of CancellationTokenSource+Registrations, I wonder if this is actually due to Microsoft.Data.SqlClient failing to Dispose of a CancellationTokenRegristration so I guess I'll go poke around over there.

I'll hold onto this dump for a few days if anybody out there was curious enough to dig through it with me.

danielmarbach commented 1 year ago

@aarondandy you might be affected by

https://github.com/dotnet/SqlClient/issues/1810

aarondandy commented 1 year ago

@aarondandy you might be affected by

dotnet/SqlClient#1810

Oh I think this is exactly it! I feel like I came across this a month ago but wasn't smart enough to piece it together at the time. I'm not sure if this will work for me, but while I wait for that fix I wonder if wrapping the cancellationToken I get from ASB/AMQP in a new CancellationTokenSource would be enough to decouple the linked lists held by the ReceiveAsyncResult timer and the one used by SqlCommand, to let the GC take care of things quicker.

aarondandy commented 1 year ago

Looking at the code behind CancellationTokenSource, I think I'm just doomed and going to need to try their preview release or go through constant service restarts 😬. I'll give the preview a try for a bit and if that resolves the issue then I think it definitely proves my issues are unrelated to ASB or AMQP.

danielmarbach commented 1 year ago

Damn sorry to hear. I wish I could be of more help, but I'm just a community sidekick occasionally sneaking into some repos making some noise.

aarondandy commented 1 year ago

Microsoft.Data.SqlClient 5.1.0-preview2.22314.2 brought the memory usage down from gigabytes to megabytes 🎉. Thanks for the tip @danielmarbach . I'm closing this as I'm pretty sure it's unrelated to this library now, @xinchen10 .