AzureAD / azure-activedirectory-identitymodel-extensions-for-dotnet

IdentityModel extensions for .Net
MIT License
1.05k stars 397 forks source link

Issue with _currentConfiguration being null after the first call #2050

Open dmitrymal opened 1 year ago

dmitrymal commented 1 year ago

Summary: We ran into an issue with our Azure Function that has a very high throughput. With high volume functions a lot of new function instances start and go. Sometime on the startup of a new instance, the very first call to the GetConfigurationAsync fails (e.g. OKTA server was unavailable). This causes all the subsequent calls to fail for the next 5 minutes.

Details: When this line var configuration = await _configRetriever.GetConfigurationAsync(MetadataAddress, _docRetriever, CancellationToken.None).ConfigureAwait(false); fails, the exception is caught and _syncAfter is set to now + 5 mins. However, _currentConfiguration is still null.

On the subsequent calls the condition if (_syncAfter <= now) is true and the code falls through to this line https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/614642a18baf304d67a4a6e711074efb71262c35/src/Microsoft.IdentityModel.Protocols/Configuration/ConfigurationManager.cs#L209 causing the another exception.

This keeps happing for 5 mins until _syncAfter becomes > than now In our case this behavior causes 10 - 50K exceptions in 5 min timeframe.

Proposed fix: add || _currentConfiguration == null condition to the following if statement: https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/blob/614642a18baf304d67a4a6e711074efb71262c35/src/Microsoft.IdentityModel.Protocols/Configuration/ConfigurationManager.cs#L170 Something like this: if (_currentConfiguration == null || _syncAfter <= now)

Question: Before this is fixed, is there any workaround that you could suggest?

Disclaimer: All this analysis is done just by reviewing the code, so I might have overlooked something. Let me know if this sounds correct to you.

brentschmaltz commented 1 year ago

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times. Does this seem reasonable?

dmitrymal commented 1 year ago

@dmitrymal we are modifying this so that until configuration is obtained, we will retry at a 30 second interval 4 times. Does this seem reasonable?

It is your call, but personally, I would have left it up to the client to decide what retry strategy to use. '30 second interval 4 times' seems just a random decision and may not work for all clients. For instance, in our case with the rate of 10K requests per minute, we might end up with 40K queued requests waiting for a retry (which is probably not good).

After all, the issue is not about not being able to get the configuration or retry logic. The issue is in the code which does not check whether or not the _currentConfiguration is null.

In my opinion, the code should check if currentConfiguration == null and if it is, try to get it. The code should not simply throw an exception leaving the client without any option to fix the issue. This logical bug should be fixed regardless of retry logic. The easiest and quickest way to fix it is to have it something like this: if (_currentConfiguration == null || _syncAfter <= now)

I could create a PR with the proposed changes if that would help.

By the way, I see you labeled it as an 'Enhancement'. I would consider this as a Bug (not an Enhancement). Thoughts?

brentschmaltz commented 1 year ago

@dmitrymal A request has arrived and we need to validate the token. Metadata may be needed since TokenValidationParameters.ConfigurationManager is not null the user must have set it. If metadata cannot be obtained, we may not be able to validate the token, but we will try.

It seems correct that if ConfigurationManger never obtained metadata, we should try again next time through. A simple change would be to not set _syncAfter in the exception block here.

I need to circle back with the team to understand why continuous attempts without a throttle were not acceptable.

brentschmaltz commented 1 year ago

@dmitrymal we discussed this as a team and landed on exponential backoff.

dmitrymal commented 1 year ago

@dmitrymal we discussed this as a team and landed on exponential backoff.

The retry logic might help (in some cases). However, the client code should have control of the retry logic. Otherwise, it might lead to more issues. At the very least the client should be able to set/define retry parameters.

Anyway, all this (the retry discussion) is a separate concern and has nothing to do with the reported issue in the code. If _currentConfiguration ends up to be null even after retries, the client would still run into the same issue.

Any thoughts on fixing that?

brentschmaltz commented 1 year ago

@dmitrymal we have implemented a exponential increase in this PR https://github.com/AzureAD/azure-activedirectory-identitymodel-extensions-for-dotnet/pull/2052

We should leave this open to add the ability for users to set their own algorithm as you are correct in that we can assume to understand all the scenarios.