AzureAD / microsoft-authentication-library-for-dotnet

Microsoft Authentication Library (MSAL) for .NET
https://aka.ms/msal-net
MIT License
1.39k stars 340 forks source link

[Bug] MSAL Suggested cache key can create performance issues to due to over serialization #1948

Closed RamjotSingh closed 3 years ago

RamjotSingh commented 4 years ago

Which Version of MSAL are you using ? 4.16.1

Platform netcore

What authentication flow has the issue?

Other? - please describe;

Is this a new or existing app? The app is in production, and I have started using MSAL instead of ADAL

Issue MSAL today in it's suggested cache key makes following suggestions

However there are a couple of issues I see with this

Issue with App context cache

We recently onboarded MSAL (shifting away from ADAL) and during rollout we saw a trace (False) MSAL 4.16.1.0 MSAL.NetCore Microsoft Windows 10.0.14393 [07/18/2020 13:52:56] Serializing token cache with 4247 items.

This is happening because all app context tokens are being pushed into one single cache and since the tokens are tenantized, the moment a significant number of tenant hit the call the token cache just keeps on growing.

Issue with User context cache

For user context caches, since the cache key for OnBehalfOf exchange is the original token, as lots of callers make the call using unique clients (we have thousands of clients sending us potentially 100s of unique tokens per user) the token for same user gets cached over and over.

Solution which I propose

Based on the issues above, here is what I recommend as the token caches

Additionally TokenCacheNotificationArgs should expose additional information like resourceId since for services which hit a lot of external services even tenant level cache might be too big and they might need to go granular.

bgavrilMS commented 4 years ago

CC @jmprieur @jennyf19

The main driver for this feature is Microsoft.Identity.Web

jmprieur commented 4 years ago

Thanks @RamjotSingh, this is good feedback. @bgavrilMS @jennyf19 @henrik-me let's discuss if we want to take this quickly (before many people use it, as it's a breaking change) . @RamjotSingh : For the OBO case we have to take the token hash for security reason (CA claims might have changed. the home account id is not enough). But could also add the client ID.

bgavrilMS commented 4 years ago

@jmprieur - would it not be enough to use home account id as the cache key? MSAL would still do the right thing and try to acquire OBO token with initial token hash...

RamjotSingh commented 4 years ago

@bgavrilMS @jmprieur What would be the implications of using accountId as cache key but still using token hashes as token reacquire pattern? Does MSAL store along side serialized data which token was used to generate this cache entry? If this is the case then I wonder if using token hash is actually a better approach. The test bed for any key should be, it should have manageable number of tokens given any of the following being true (all of these should be assuming distributed cache and in memory cache both implementations)

These are things which come to my mind right now. Current cache keys don't hold water for a lot of these.

bgavrilMS commented 4 years ago

Yes, using home_account_id as cache key this is what I am also thinking. Moreover, the original assertion for OBO will expire very 1h, so any cache entries that contain it will never be seen. This is not ok for token caches that do not support Time To Live configurations (SQL? ).

MSAL stores the assertion hash along with each access token. When you call OBO with assertion hash A1, we load all the access tokens and filter down only to the ones associated with A1. OBO is one of those few methods that looks in the cache by itself. You do not need to call AcquireTokenSilent, in fact this is not even recommended.

MSAL always knows how to query the cache - even if you give it a cache with 10,000 tokens. Access tokens are keyed by: home_account_id, environment, client_id, tenant_id and scopes and assertion hash. So the problem is only suggesting a distributed cache key that balances the load well.

Cache value <100KB is more than reasonable assumption. A token cache with 1 AT, 1 RT, 1 IdT and 1 Account (with few scopes) is about 7-8 KB (the AT entry is about 4KB). This is based on char counts, and 1 char != 1 byte, but we encode in UTF-8 and generally 1 char = 1 byte.

bgavrilMS commented 4 years ago

By the way @RamjotSingh - we are trying to add a few operational data points to MSAL, such as an enum that tells you where the token comes from (cache or AAD). Do you have suggestion for other data points?

RamjotSingh commented 4 years ago

@bgavrilMS Might be worth exposing information from AAD which is useless for request tracking. MSAL allows setting correlationId headers, but if I remember correctly EvoSTS sends back a bunch of headers which are useless when contacting AAD support.

Other data point might be time taken to acquire token, since one call to MSAL can actually mean multiple calls to underlying service (config discovery etc) might be worth adding these data points.

From token cache perspective, TokenCacheNotificationArgs should include TenantId, ResourceId.

jmprieur commented 4 years ago

Thanks @bgavrilMS and @RamjotSingh for these explanations.

Conclusion: proposing to go ahead with:

@jennyf19 do you see any objections?

@bgavrilMS : would it make sense to also expose TenantId and ResourceId in the args? I would suggest raising another issue for this, @RamjotSingh.

jmprieur commented 4 years ago

@RamjotSingh Given our investigation, I don't think that changing the SugestedCacheKey is the problem, at least for the user token cache. Do we want to take the App context change? Is there value ?

for the user context, recommending to keep what we have. I raised another issue for the performance issue you discovered (and the fact that the cache accessors are called twice): https://github.com/AzureAD/microsoft-authentication-library-for-dotnet/issues/1979

RamjotSingh commented 4 years ago

@jmprieur Yes their is still value to app context change. Because regardless of the other cache issues, putting all tokens in a single bucket for app context will have issues no matter what we do. For high traffic services they will be storing and pulling 1000s of token atleast. Which in itself is super bad, MSAL client caching issue or no.

bgavrilMS commented 4 years ago

@jmprieur - question around app context key, where you propose to use the tenant ID as part of the key.

It is possible (although highly not recommended) to start off with the common authority for client creds. So before making the request, the cache key is "clientID_common_AppTokenCache"

After the request comes back from the server, we need to save the token somewhere. But now we have the the actual tenant id, which we cannot use however because the request tenant ID is wrong.

My proposal is to always use the request tenant ID, as that is consistent and should not be "common"

bgavrilMS commented 4 years ago

As discussed with @jmprieur , decided not to change the cache keys.

bgavrilMS commented 3 years ago

We added tenant ID to suggedsted cache keys.