[Bug] cache.ExportReplace is called with different partition key in 1.0

weikanglim commented 1 year ago

Which version of MSAL Go are you using? Microsoft Authentication Library for Go 1.0.0-preview

Where is the issue?

Public client
- [ ] Device code flow
- [ ] Username/Password (ROPC grant)
- [x] Authorization code flow
Confidential client
- [ ] Authorization code flow
- [ ] Client credentials:
  - [ ] client secret
  - [ ] client certificate
Token cache serialization
- [ ] In-memory cache
Other (please describe)
- [x] cache.ExportReplace

Is this a new or an existing app? a. The app is in production and I have upgraded to a new version of Microsoft Authentication Library for Go.

What version of Go are you using (go version)?

$ go version
go version go1.20.4 darwin/arm64

What operating system and processor architecture are you using (go env)? Observed reproducing on all OSes and amd64 + arm64.

Repro

In production, it affects cache.ExportReplace implementation in Azure/azure-dev: https://github.com/Azure/azure-dev/blob/main/cli/azd/pkg/auth/cache.go

Minimal repro:

git clone https://github.com/weikanglim/msal-export-replace.git
Skip this step to test observe MSAL 1.0 behavior on the default branch main. Run git checkout old to observe MSAL 0.81 behavior.
go run main.go
Sign in using the browser
Observe the log outputs

Expected behavior

In old version, 0.81 of msal-go, we observe that calling AcquireTokenInteractive (line 1-2), followed by AcquireTokenSilent results in the first Replace() call (line 5) from base.Client.AcquireTokenSilent with key = '':

1 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
2 Export(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
3 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AllAccounts
4 Export(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AllAccounts
5 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AcquireTokenSilent
6 Export(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AcquireTokenSilent

Actual behavior In new version, 1.0 of msal-go, we observe that calling AcquireTokenInteractive (line 1-2), followed by AcquireTokenSilent results in the first Replace() call (line 4) from base.Client.AcquireTokenSilent with key = '<homeAccountId>':

1 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
2 Export(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
3 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AllAccounts
4 Replace(43bb7435-f8c4-4342-9788-fd15e454ea12.72f988bf-86f1-41af-91ab-2d7cd011db47): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AcquireTokenSilent

Possible solution

The production app, azd, currently stores the MSAL cache data on disk based on the key. Our current logic looks like:

publicCient.AcquireTokenInteractive - Acquire token. Export() will be called and the data will be stored on disk
publicClient.AcquireTokenSilent - Replace() should be called and read from the stored data on disk

However, because Replace(<homeId>) is being called instead of Replace(), we are seeing customer reports of our login flow failing immediately because AcquireTokenSilent is reading from a stale cache, with the following HTTP error observed:

2023/05/23 14:19:28 azd_credential.go:34: POST https://login.microsoftonline.com/organizations/oauth2/v2.0/token
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
--------------------------------------------------------------------------------
{
  "error": "invalid_grant",
  "error_description": "AADSTS50173: The provided grant has expired due to it being revoked, a fresh auth token is needed. The user might have changed or reset their password. The grant was issued on '2023-03-20T18:55:38.2560432Z' and the TokensValidFrom date (before which tokens are not valid) for this user is '2023-03-29T18:39:41.0000000Z'.\r\nTrace ID: ******\r\nCorrelation ID: ****\r\nTimestamp: 2023-05-23 19:19:29Z",        
  "error_codes": [
    50173
  ],
  "timestamp": "2023-05-23 19:19:29Z",
  "trace_id": "****",
  "correlation_id": "***",
  "error_uri": "https://login.microsoftonline.com/error?code=50173"
}
--------------------------------------------------------------------------------

The behavioral change is explained by this change in base.go, where a subtle re-sequencing of authParams.AuthorizationType = authority.ATRefreshToken is now before the Replace() call, and authParams.CacheKey reacts differently due to this line herelmnow that AuthorizationType is set to authority.ATRefreshToken

Questions:

Is the application implementation incorrect? How should a consumer library correctly handle ReplaceHints.PartitionKey (previously called key)?
Is it expected that AcquireTokenSilent with the same publicClient and account results in Replace() being called with a different partition key?

Additional context / logs / screenshots Add any other context about the problem here, such as logs and screenshots.

weikanglim commented 1 year ago

I'm starting to suspect that applications should not react to partitionKey unless used specifically for on-behalf-of scenarios?

The partitioned manager in MSAL seems to be only component that caches by key in OBO, but the default storage manager just assumes 1:1.

Our CLI application only supports a 1:1 relationship between AD user and OS user, storing the token cache under a user-specific directory, so perhaps ignoring partitionKey is the right thing to do?

weikanglim commented 1 year ago

RE: How does multitenancy play in to this, if at all?

chlowell commented 1 year ago

This is an interesting issue, thanks for opening it. I believe the old behavior--suggesting a partition key for OBO data only--was incorrect and that azd can ignore the suggestion in any case.

Applications decide whether to partition an external cache. MSAL suggests a key but has no opinion on how applications use it. Partitioning is important for OBO data because the application doesn't get user account information in that flow, however this isn't the only scenario in which it makes sense to partition the cache. For example, a web app may distribute cache data across some number of databases.

My naive belief is that azd, being a CLI application, has nothing to gain by partitioning its cache and can therefore just ignore the suggested key. It isn't necessarily harmful to partition the cache, however I imagine a change in partitioning between azd versions would probably force users to reauthenticate after upgrading. I don't see a connection between partitioning and the AAD error you shared though, so perhaps I'm missing something important about azd's cache.

RE: How does multitenancy play in to this, if at all?

Can you elaborate some more? I don't see a connection.

Is it expected that AcquireTokenSilent with the same publicClient and account results in Replace() being called with a different partition key?

No; is that happening? I expect the key to be the user's home account ID.

weikanglim commented 1 year ago

Partitioning is important for OBO data because the application doesn't get user account information in that flow, however this isn't the only scenario in which it makes sense to partition the cache. For example, a web app may distribute cache data across some number of databases.

That makes sense.

I don't see a connection between partitioning and the AAD error you shared though, so perhaps I'm missing something important about azd's cache.

The connection here, based on my observations are:

azd is currently using the suggested key not as a partitioning key, but strictly as a lookup key. This is a bug in azd's implementation.
When a user triggers the interactive login flow, the login flow fails due to stale cache reading, due to the sequence of calls observed below:

1 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
2 Export(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AuthResultFromToken
3 Replace(): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AllAccounts
4 Replace(43bb7435-f8c4-4342-9788-fd15e454ea12.72f988bf-86f1-41af-91ab-2d7cd011db47): called from github.com/AzureAD/microsoft-authentication-library-for-go/apps/internal/base.Client.AcquireTokenSilent

Which results in the following behavior:

Step 1 and 2 are indirect calls from AcquireTokenInteractive, which populates a cache with lookup key = "".
Step 4 is coming from AcquireTokenSilent, which is what azd relies on to not prompt the user in-between CLI commands. It reads from the cache with lookup key = "{homeAccountId}". This was never updated in the login flow, and is potentially stale.

This used to work by coincidence, simply because Step 4 was previously a call to Replace("") and not Replace("{homeAccountId}").

Is it expected that AcquireTokenSilent with the same publicClient and account results in Replace() being called with a different partition key? No; is that happening? I expect the key to be the user's home account ID.

Incorrect statement on my part. In the linked code example, it is AcquireTokenInteractive, providing no partition key hints, but AcquireTokenSilent does. I'm definitely not confident to suggest anything wrong here, just curious what's the expected behavior.

chlowell commented 1 year ago

AcquireTokenInteractive, providing no partition key hints, but AcquireTokenSilent does.

Thanks for pointing this out. Both methods should suggest the same key (#424). I suppose azd might work as before if they did, however ignoring partition keys still looks like the best solution and I imagine would be faster to implement than waiting for an MSAL patch. What do you think?

weikanglim commented 1 year ago

@chlowell Absolutely. Sorry, was late on this update, and forgot to highlight it earlier --we had went ahead with the change to remove the logic that treated partitioning as the lookup key.

On a side note: Is cache.ExportReplace the stable interface? I can submit a issue / PR to try to maybe update the interface documentation to contain some of the considerations an implementor might think about based on our learnings.

chlowell commented 1 year ago

Great, I'm glad azd is unblocked.

The v1.0.0 API including cache.ExportReplace is stable, please feel free to contribute doc updates.

phanirithvij commented 1 year ago

@weikanglim could you update the docs with your findings? I am facing the same issue and solved it with hardcoding the key and ignoring the suggested key, what was your approach?

weikanglim commented 1 year ago

@phanirithvij Sorry, haven't come around to this. As @chlowell suggests above:

Applications decide whether to partition an external cache. MSAL suggests a key but has no opinion on how applications use it. Partitioning is important for OBO data because the application doesn't get user account information in that flow, however this isn't the only scenario in which it makes sense to partition the cache. For example, a web app may distribute cache data across some number of databases.

The key is meant as a suggested partitioning key for an external cache. In our CLI, using it as a lookup key for the user was incorrect, and we modified the logic to avoid using the key in this manner (change here).

AzureAD / microsoft-authentication-library-for-go

[Bug] cache.ExportReplace is called with different partition key in 1.0 #423