Open mderriey opened 1 month ago
I'm currently experiencing major problems with this exact issue. We have two web servers handling about 500 requests a minute during the peak of the day. There are times when that can burst up to several thousand requests for 30 seconds to 2 minutes. We recently switched from IIS Windows Authentication to Entra-based authentication. Prior to that switch, these bursts of web requests were easily handled. Now, bursts of requests cause both web servers to stop responding for several minutes to a half hour or longer.
On the server itself, we can see w3wp.exe start climbing in thread count, approximately one per second, until it spawns thousands of threads while CPU isn't really affected. I've pulled several crash dumps and just about every thread is stuck on Microsoft.Identity.Web.OpenIdConnectCachingSecurityTokenProvider.get_Issuer() or .get_SecurityKeys() where there's a call to TryEnterReadLock().
If there are any recommendations for how to deal with this, I would greatly appreciate it.
There was an issue in Microsoft.Identity.Model 8.0.2 that got fixed in 8.1.1. Would you have a dependencies on these?
All of the Microsoft.IdentityModel.x packages in my project are 8.1.2. I believe I read about the bug fix you're referring to and made sure to update, but this issue is still popping up. The crash dumps I pored over were taken today with these updates already in place.
There was an issue in Microsoft.Identity.Model 8.0.2 that got fixed in 8.1.1. Would you have a dependencies on these?
I think both @alex-set and myself have clearly identified that the issue lies in OpenIdConnectCachingSecurityTokenProvider
.
On my end, we have numerous thread analysis from the Azure portal that point to the excessive locking/synchronization.
As mentioned in my original message, it seems to be a snowball effect, once a few requests hit the situation where they're not processed quickly enough (and don't release the write lock quickly enough), more requests come in, and that aggravates the issue to the point where the app dies.
We've had several production outages yesterday, so we're most likely going to try and implement a workaround that I'll post here once merged, with the hope that:
I must admit that I’m very confused about the OpenIdConnectCachingSecurityTokenProvider class. Maybe I’m missing something obvious. As I understand it, it just has two public properties which are necessary to implement IIssuerSecurityKeyProvider with no other externally facing functionality. Every time either of those property getters are called, the following steps are taken:
Why does it need a read/write lock when it only uses the read lock immediately after exiting the write lock to access fields that it just set? Couldn’t it just remember the value that was set when it held the lock and return that? In fact, why does it even need the local fields considering that their values are refreshed every single time the getters are called. There is no mechanism to access those fields without first refreshing them, so why not just get the requested value and return it and leave the local fields out of it?
More importantly, why does it use a lock in a property getter to synchronously wait for an async method? Even if _configManager.GetConfigurationAsync returns cached information, it’s still forced to run on a different thread because it’s called using Task.Run. If it takes a second to acquire the thread to invoke the method, it won’t matter if the method returns without doing any async work, it’s still a bottleneck. If thirty other requests pile up behind that lock during that second, each of those are processed one-at-a-time with more requests queuing behind each of those until all the icebergs have melted and the sun envelopes the earth.
Sorry for the over-analysis. Let me know if my assertions are off base.
Agreed.
Adding to what seems to be unnecessary synchronization, there is no timeout when trying to acquire these locks, so threads could be (and are) waiting forever until they can acquire the lock.
Threads can't enter read lock if at least a thread is entering write lock, so as more requests come up, it aggravates the situation.
Thank you @alex-set and @mderriey for sharing your analysis. This all makes sense to me. It's a bug that needs to be fixed.
Microsoft.Identity.Web Library
Microsoft.Identity.Web.OWIN
Microsoft.Identity.Web version
3.2.2
Web app
Not Applicable
Web API
Not Applicable
Token cache serialization
Not Applicable
Description
TL;DR: under enough concurrency, the locking behaviour of
OpenIdConnectCachingSecurityTokenProvider
creates a deadlock-like situation.We've experienced cases where our app grinds to a halt when a certain concurrency threshold is reached. Snapshot analysis pointed to
OpenIdConnectCachingSecurityTokenProvider
, more specifically theIssuer
andSecurityKeys
properties.Looking at the implementation, and reading the
ReaderWriterLockSlim
documentation, I think I can see how this happens.The a-ha moment for me was when reading the following section from the documentation linked above (I left the upgradeable mode away as it's not using it)
Looking at the implementation, we can see that accessing
Issuer
orSecurityKeys
will:RetrieveMetadata
which enters "write" mode to get the configuration from the underlyingConfigurationManager
instances, thenHere's a mental model of what happens for a single request:
RetrieveMetadata
, which enters "write" mode and fetches the configuration.SecurityKeys
properties is accessed:RetrieveMetadata
, which enters "write" mode and fetches the configuration.Extrapolating this to many concurrent requests, here's what I think happens:
This means that concurrent requests are dependent on all other concurrent requests to have entered then exited write mode to be able to enter read mode and continue the processing of the request. The bigger the "write" mode queue gets, the higher the chance of new requests arriving and making the problem worse.
The only way out of this situation is for no new requests to arrive, so the existing concurrent requests can enter and exit "write" mode one after the other, after which they will all be able to enter "read" mode.
However, the fact that we call
RetrieveMetadata
twice in a single request (.Issuer
and.SecurityKeys
, both accessed byJwtFormat
) doesn't make it easy to get out of this situation. Even more so whenConfigurationManager
caches the OIDC configuration for 12 hours by default, so the vast majority of these calls will be no-ops when it comes to the actual config used to validate tokens.Another aggravating factor is that we use
EnterReadMode
andEnterWriteMode
, which will wait for as long as needed until they can enter the relevant mode. Note thatTryEnterReadMode
andTryEnterWriteMode
exist, with timeout parameters, and their return value indicate whether entering the mode was successful.Here are assumptions about Entra ID / B2C:
Using these assumptions, I think we can loosen the locking behaviour:
Any thoughts?
Reproduction steps
Error message
No response
Id Web logs
No response
Relevant code snippets
Regression
No response
Expected behavior
No locking that brings the app to a halt.