AzureAD / microsoft-identity-web

Helps creating protected web apps and web APIs with Microsoft identity platform and Azure AD B2C
MIT License
684 stars 217 forks source link

Locking behavior too strict in `OpenIdConnectCachingSecurityTokenProvider`, struggles with enough concurrent requests #3078

Open mderriey opened 1 month ago

mderriey commented 1 month ago

Microsoft.Identity.Web Library

Microsoft.Identity.Web.OWIN

Microsoft.Identity.Web version

3.2.2

Web app

Not Applicable

Web API

Not Applicable

Token cache serialization

Not Applicable

Description

TL;DR: under enough concurrency, the locking behaviour of OpenIdConnectCachingSecurityTokenProvider creates a deadlock-like situation.

[!NOTE] I've seen this class in enough Microsoft repositories to suspect that you don't really own it per-se, but this repo seems to be the go-to when it comes to OIDC auth across all stacks, and also the most recent commit for this file, so you're the "lucky" ones to get this report.

We've experienced cases where our app grinds to a halt when a certain concurrency threshold is reached. Snapshot analysis pointed to OpenIdConnectCachingSecurityTokenProvider, more specifically the Issuer and SecurityKeys properties.

Looking at the implementation, and reading the ReaderWriterLockSlim documentation, I think I can see how this happens.

The a-ha moment for me was when reading the following section from the documentation linked above (I left the upgradeable mode away as it's not using it)

If a ReaderWriterLockSlim does not allow recursion, a thread that tries to enter the lock can block for several reasons:

  • A thread that tries to enter read mode blocks if there are threads waiting to enter write mode or if there is a single thread in write mode.
  • A thread that tries to enter write mode blocks if there is a thread in any of the three modes.

Looking at the implementation, we can see that accessing Issuer or SecurityKeys will:

  1. Call RetrieveMetadata which enters "write" mode to get the configuration from the underlying ConfigurationManager instances, then
  2. Enter "read" mode to return the value of the private field.

Here's a mental model of what happens for a single request:

  1. The Issuer property is accessed:
    1. This calls RetrieveMetadata, which enters "write" mode and fetches the configuration.
    2. It then enters read mode to return the value.
  2. The SecurityKeys properties is accessed:
    1. This calls RetrieveMetadata, which enters "write" mode and fetches the configuration.
    2. It then enters read mode to return the value.

Extrapolating this to many concurrent requests, here's what I think happens:

  1. The first request comes in, and its thread T1 enters "write" mode.
  2. Other requests come in, and their threads block trying to enter "write" mode because of T1.
  3. T1 exits "write" mode, and cannot enter "read" mode because other threads are waiting to enter "write" mode.
  4. Another thread T2 enters "write" mode.
  5. More requests arrive, try to enter "write" mode, but can't, and get added to the "write" mode queue.

This means that concurrent requests are dependent on all other concurrent requests to have entered then exited write mode to be able to enter read mode and continue the processing of the request. The bigger the "write" mode queue gets, the higher the chance of new requests arriving and making the problem worse.

The only way out of this situation is for no new requests to arrive, so the existing concurrent requests can enter and exit "write" mode one after the other, after which they will all be able to enter "read" mode.

However, the fact that we call RetrieveMetadata twice in a single request (.Issuer and .SecurityKeys, both accessed by JwtFormat) doesn't make it easy to get out of this situation. Even more so when ConfigurationManager caches the OIDC configuration for 12 hours by default, so the vast majority of these calls will be no-ops when it comes to the actual config used to validate tokens.

Another aggravating factor is that we use EnterReadMode and EnterWriteMode, which will wait for as long as needed until they can enter the relevant mode. Note that TryEnterReadMode and TryEnterWriteMode exist, with timeout parameters, and their return value indicate whether entering the mode was successful.


Here are assumptions about Entra ID / B2C:

  1. For a given tenant, the issuer never changes (I'm aware this is metadata endpoint-dependent, but the class can only target one endpoint).
  2. Signing keys will be advertised in the metadata document long before they're used for signing so all consumers have the time to know about them.

Using these assumptions, I think we can loosen the locking behaviour:

  1. Try to get a write lock "synchronously", otherwise use the information already available.
  2. Do not use read locks, and use the information available.

Any thoughts?

Reproduction steps

  1. Set up an OWIN project with Microsoft.Identity.Web.
  2. Fire enough concurrent requests to trigger the blocking behaviour (200 was the tipping point on my local machine)

Error message

No response

Id Web logs

No response

Relevant code snippets

N/A

Regression

No response

Expected behavior

No locking that brings the app to a halt.

alex-set commented 4 weeks ago

I'm currently experiencing major problems with this exact issue. We have two web servers handling about 500 requests a minute during the peak of the day. There are times when that can burst up to several thousand requests for 30 seconds to 2 minutes. We recently switched from IIS Windows Authentication to Entra-based authentication. Prior to that switch, these bursts of web requests were easily handled. Now, bursts of requests cause both web servers to stop responding for several minutes to a half hour or longer.

On the server itself, we can see w3wp.exe start climbing in thread count, approximately one per second, until it spawns thousands of threads while CPU isn't really affected. I've pulled several crash dumps and just about every thread is stuck on Microsoft.Identity.Web.OpenIdConnectCachingSecurityTokenProvider.get_Issuer() or .get_SecurityKeys() where there's a call to TryEnterReadLock().

If there are any recommendations for how to deal with this, I would greatly appreciate it.

jmprieur commented 4 weeks ago

There was an issue in Microsoft.Identity.Model 8.0.2 that got fixed in 8.1.1. Would you have a dependencies on these?

alex-set commented 4 weeks ago

All of the Microsoft.IdentityModel.x packages in my project are 8.1.2. I believe I read about the bug fix you're referring to and made sure to update, but this issue is still popping up. The crash dumps I pored over were taken today with these updates already in place.

mderriey commented 4 weeks ago

There was an issue in Microsoft.Identity.Model 8.0.2 that got fixed in 8.1.1. Would you have a dependencies on these?

I think both @alex-set and myself have clearly identified that the issue lies in OpenIdConnectCachingSecurityTokenProvider.

On my end, we have numerous thread analysis from the Azure portal that point to the excessive locking/synchronization.

image

As mentioned in my original message, it seems to be a snowball effect, once a few requests hit the situation where they're not processed quickly enough (and don't release the write lock quickly enough), more requests come in, and that aggravates the issue to the point where the app dies.

We've had several production outages yesterday, so we're most likely going to try and implement a workaround that I'll post here once merged, with the hope that:

  1. It can be assessed by the team maintaining this repo.
  2. It can help other people affected by this issue.
alex-set commented 4 weeks ago

I must admit that I’m very confused about the OpenIdConnectCachingSecurityTokenProvider class. Maybe I’m missing something obvious. As I understand it, it just has two public properties which are necessary to implement IIssuerSecurityKeyProvider with no other externally facing functionality. Every time either of those property getters are called, the following steps are taken:

  1. Call RetrieveMetadata a. Enter the write lock b. Start the _configManager.GetConfigurationAsync method with a thread pool task (via Task.Run) c. Wait for the task to complete d. Set the value of two local fields e. Exit the write lock
  2. Enter the read lock
  3. Get one of the two local field values that were just set
  4. Exit the read lock
  5. Return the value

Why does it need a read/write lock when it only uses the read lock immediately after exiting the write lock to access fields that it just set? Couldn’t it just remember the value that was set when it held the lock and return that? In fact, why does it even need the local fields considering that their values are refreshed every single time the getters are called. There is no mechanism to access those fields without first refreshing them, so why not just get the requested value and return it and leave the local fields out of it?

More importantly, why does it use a lock in a property getter to synchronously wait for an async method? Even if _configManager.GetConfigurationAsync returns cached information, it’s still forced to run on a different thread because it’s called using Task.Run. If it takes a second to acquire the thread to invoke the method, it won’t matter if the method returns without doing any async work, it’s still a bottleneck. If thirty other requests pile up behind that lock during that second, each of those are processed one-at-a-time with more requests queuing behind each of those until all the icebergs have melted and the sun envelopes the earth.

Sorry for the over-analysis. Let me know if my assertions are off base.

mderriey commented 4 weeks ago

Agreed.

Adding to what seems to be unnecessary synchronization, there is no timeout when trying to acquire these locks, so threads could be (and are) waiting forever until they can acquire the lock.

Threads can't enter read lock if at least a thread is entering write lock, so as more requests come up, it aggravates the situation.

jmprieur commented 4 weeks ago

Thank you @alex-set and @mderriey for sharing your analysis. This all makes sense to me. It's a bug that needs to be fixed.