AzureAD / microsoft-authentication-library-for-python

Microsoft Authentication Library (MSAL) for Python makes it easy to authenticate to Microsoft Entra ID. General docs are available here https://learn.microsoft.com/entra/msal/python/ Stable APIs are documented here https://msal-python.readthedocs.io. Questions can be asked on www.stackoverflow.com with tag "msal" + "python".
https://stackoverflow.com/questions/tagged/azure-ad-msal+python
Other
756 stars 191 forks source link

Using `allow_broker` with device code flow causes token refreshing to fail: `Account with id '(pii)' not found.` #563

Closed jiasli closed 8 months ago

jiasli commented 1 year ago

Describe the bug Using allow_broker with device code flow causes token refreshing to fail: Account with id '(pii)' not found.

To Reproduce

# Opt-in WAM
az config set core.allow_broker=true

# Turn off token encryption so that we can edit it manually later
az config set core.encrypt_token_cache=false

# Log in with device code, but with an account not in WAM, such as domain account testuser@azuresdkteam.onmicrosoft.com
az login --tenant 54826b22-38d6-4fb2-bad9-b7b93a3e9c5a --use-device-code

# Edit ~/.azure/msal_token_cache.json. Change AccessToken.<key>.expires_on to 0 to make the access token expire
# {
#     "AccessToken": {        
#         "...": {
#             ...
#             "expires_on": "0",

# Trigger token refreshing
az group list

Expected behavior Token refreshing should succeed.

What you see instead

Account with id '(pii)' not found. Status: Response_Status.Status_AccountNotFound, Error code: 0, Tag: 525678464

Apparently, this is because device code's refresh token is saved to MSAL token cache, not WAM, but during token refreshing, MSAL tries to get access token from WAM.

The MSAL Python version you are using Latest dev branch: 56256e4c516e0d8469b554223f4bf3954f77437e

Additional context I understand there is no point to use device code flow when broker is available, but it's better to make the whole workflow more user-friendly.

If MSAL exposes enable_broker rather than allow_broker (like what it is doing internally):

https://github.com/AzureAD/microsoft-authentication-library-for-python/blob/2168027c4694b243ddeff08592396557d9848de0/msal/application.py#L564

life will be much easier - simply raise an error if acquire_token_by_device_flow is called on a PublicClientApplication with enable_broker=True.

Such allow_... design makes the logic far more complex - both acquire_token_by_device_flow and acquire_token_silent_with_error need to implement such fallback logic and be consistent with each other. In this issue, acquire_token_by_device_flow falls back to MSAL token cache, but acquire_token_silent_with_error doesn't and still reads from WAM.

rayluo commented 1 year ago

Yes, I can reproduce this issue. And @jiasli guessed the reason right, "Apparently, this is because device code's refresh token is saved to MSAL token cache, not WAM, but during token refreshing, MSAL tries to get access token from WAM."

The reason for that is, back then when we implemented the Broker/MsalRuntime/WAM integration, the guidance was "surface all MsalRuntime errors, instead of fallback to non-MsalRuntime code path". Now that I think about it, an MsalRuntime error of "hey I cannot find this account inside my (MsalRuntime's) cache so it is not my business", is not really an error, but an invite to tell MSAL Python to fallback gracefully.

I incline to change MSAL Python to also fallback when the error happens in read_account_by_id(), unless @msamwils has different opinion. Feel free to contact me if you need more info or discussion.

MSamWils commented 1 year ago

@rayluo , my understanding is that MSALRuntime do not support device code flow. Should MSAL Python not even invoking MSALRuntime in this scenario?

rayluo commented 1 year ago

@rayluo , my understanding is that MSALRuntime do not support device code flow. Should MSAL Python not even invoking MSALRuntime in this scenario?

MSAL Python already does not invoke MsalRuntime in device code flow. The current issue is on the second leg i.e. the acquire_token_silent(). At that point, MSAL Python will attempt MsalRuntime based on allow_broker=True. My proposal above is to gracefully absorb the "account not found" error from MsalRuntime, and then fallback to the RT in native MSAL's token cache. I built a proof-of-concept and it worked well. Shall I proceed?

jiasli commented 1 year ago

My proposal above is to gracefully absorb the "account not found" error from MsalRuntime, and then fallback to the RT in native MSAL's token cache.

I personally don't like "fallback" as "fallback" introduces unpredictability.

rayluo commented 1 year ago

My proposal above is to gracefully absorb the "account not found" error from MsalRuntime, and then fallback to the RT in native MSAL's token cache.

I personally don't like "fallback" as "fallback" introduces unpredictability.

Fair enough, and perhaps "fallback" is not the right word here. In my book (and probably in your "fallback introduces unpredictability" statement, too), fallback hints "some unknown error occurred and we do not know how to deal with it, so let's fallback to a default option". But in this case, "account not found" is actually expected, and going back to the non-broker code path is the right move.

If there are different proposal, please let us know.

rayluo commented 8 months ago

in this case, "account not found" is actually expected, and going back to the non-broker code path is the right move.

If there are different proposal, please let us know.

An update on this topic. We tried #569 but that approach was controversial. How about MSAL simply emits a warning during the initial Device Code Flow saying something like "Broker was enabled. Subsequent acquire_token_silent() will only work after a successful acquire_token_interactive()"? Azure CLI can then use with warnings.catch_warnings() to catch that warning, and then also emit a more user friendly warning from Azure CLI to end users.

What do you think, @jiasli ?

jiasli commented 6 months ago

How about MSAL simply emits a warning during the initial Device Code Flow saying something like "Broker was enabled. Subsequent acquire_token_silent() will only work after a successful acquire_token_interactive()"?

This is a regression in my humble opinion.

As https://github.com/AzureAD/microsoft-authentication-library-for-python/pull/569 as been merged, after bumping MSAL in https://github.com/Azure/azure-cli/pull/27726, I can verify the device code flow now works as before even when broker is enabled (by running az config set core.enable_broker_on_windows=true).

I can see in ~/.azure/msal_token_cache.json, the account looks like

    "Account": {
        "...": {
            ...
            "username": "xxx@xxx.onmicrosoft.com",
            "authority_type": "MSSTS",
            "account_source": "urn:ietf:params:oauth:grant-type:device_code"
        }
    },

In addition, for auth code flow, account_source is authorization_code; for username password flow, account_source is password.