AzureAD / microsoft-authentication-library-for-python

Microsoft Authentication Library (MSAL) for Python makes it easy to authenticate to Microsoft Entra ID. General docs are available here https://learn.microsoft.com/entra/msal/python/ Stable APIs are documented here https://msal-python.readthedocs.io. Questions can be asked on www.stackoverflow.com with tag "msal" + "python".
https://stackoverflow.com/questions/tagged/azure-ad-msal+python
Other
756 stars 191 forks source link

acquire_token_silent() shall not invoke broker if the account was not established by broker #569

Closed rayluo closed 8 months ago

rayluo commented 1 year ago

This is useful when the user somehow chose to use device code flow successfully and now want to do an acquire_token_silent(). The event sequence in this scenario looks like this:

  1. End user enabled broker.
  2. For whatever reason, end user chooses to sign in using Device Code Flow (DCF). Broker (WAM) does not support DCF but, per our existing design, we do not want to break this scenario, so MSAL automatically falls back to use native DCF implementation. This is already happening today, and the sign-in would be successful, and access token (AT) and refresh token (RT) will be stored inside MSAL's native token cache.
  3. Subsequent AcquireTokenSilent() call (especially those after the initial access token expires) would attempt broker code path, and end up with an "account not found" error from MsalRuntime's read_account_by_id() call, because the account was not established via MsalRuntime code path in step 2.
  4. Currently, MSAL's AcquireTokenSilent() surfaces that MsalRuntime error to calling app, and that results in end user's token refresh attempt failure.
  5. ~Per our recent discussion (in May 2023) with Azure CLI team and MsalRuntime team, this PR changes to catch error in step 3, and falls back to native token cache, whose RT will work.~
  6. Per our recent discussion (in Nov 1st, 2023) with MSAL C++ team, this PR changes to record which flow established an account, and then bypass broker in its subsequent acquire_token_silent() calls. Also applies the same treatment to Username Password flow (a.k.a. ROPC). You may review this internal design doc here.

Note: Most of the conversation below between June to October were based on the outdated bullet point No.5, so, you do not need to read them.

@msamwils, you do not have to review the implementation details in this PR (which contains some other logistic changes), but please review the summary above.

bgavrilMS commented 1 year ago

I'm not sure that I agree with step 5, because it will violate the token protection promise. The scenario that can break is:

I would let the app handle this. If the app uses device code flow, they should not use broker when calling AcquireTokenSilent.

rayluo commented 1 year ago

... it will violate the token protection promise. The scenario that can break is:

Not necessarily.

  • app uses interactive auth with browser and puts bearer RT in cache
  • app upgrades to use broker
  • app calls AcquireTokenSIlent and runtime's read_account_by_id fails because (non-)existing cached account (in WAM?)

App may already handles this - Azure CLI did - by suggesting end user to do an az account clear while switching to use broker. Therefore, the AcquireTokenSilent() simply will not be called. The end user will do a fresh "az login" which will go with broker code path and then subsequent AcquireTokenSilent() will work smoothly thereafter.

  • app now uses it's cached bearer RT and breaks token protection CA, even though it tries to use WAM

Even if the app or the end user did not clear the native MSAL's token cache, the bearer AT would not meet CA requirements indeed, at that point the resource is expected to reject such an AT with a meaningful error. This shall be a defined and expected error-handling situation of CA. The end user shall be prompted (by the app such as Azure CLI) to do "az login" again which in turn will run the broker code path.

I would let the app handle this. If the app uses device code flow, they should not use broker when calling AcquireTokenSilent.

Not quite easy. In native MSAL's calling pattern, when AcquireTokenSilent is called, there is no intrinsic state to indicate whether the existing signed-in session was established by device code flow or other flow. App could handle this, but that is an extra burden that shall be handled by MSALs, somehow (we do have other options here, such as expanding token cache schema to indicate what flow an account was established by; I thought that would be more heavyweight fix).

headless Windows isn't a common scenario, I am not sure why we'd want to support this

You brought this up in an offline chat.

Indeed, this use case is rare, and that is probably why this is not (yet?) reported by end users, by only found by Azure CLI team's internal coverage tests.

On the other hand, besides of the "do not violate the token protection promise" design tenet, another implicit tenet is "do not break existing users and scenarios". The latter is the reason why native MSALs already have many fallback logic for scenarios that we know broker/WAM does not support: ADFS, B2C, etc., and device code flow is already one of them. We just did not do it completely. Currently, when broker is enabled, the initial device code flow is successful (which is what we want) but becomes broken after one hour (i.e. the AT expires). The latter is not our intention.

bgavrilMS commented 1 year ago

... it will violate the token protection promise. The scenario that can break is:

Not necessarily.

  • app uses interactive auth with browser and puts bearer RT in cache
  • app upgrades to use broker
  • app calls AcquireTokenSIlent and runtime's read_account_by_id fails because (non-)existing cached account (in WAM?)

App may already handles this - Azure CLI did - by suggesting end user to do an az account clear while switching to use broker. Therefore, the AcquireTokenSilent() simply will not be called. The end user will do a fresh "az login" which will go with broker code path and then subsequent AcquireTokenSilent() will work smoothly thereafter.

  • app now uses it's cached bearer RT and breaks token protection CA, even though it tries to use WAM

Even if the app or the end user did not clear the native MSAL's token cache, the bearer AT would not meet CA requirements indeed, at that point the resource is expected to reject such an AT with a meaningful error. This shall be a defined and expected error-handling situation of CA. The end user shall be prompted (by the app such as Azure CLI) to do "az login" again which in turn will run the broker code path.

I would let the app handle this. If the app uses device code flow, they should not use broker when calling AcquireTokenSilent.

Not quite easy. In native MSAL's calling pattern, when AcquireTokenSilent is called, there is no intrinsic state to indicate whether the existing signed-in session was established by device code flow or other flow. App could handle this, but that is an extra burden that shall be handled by MSALs, somehow (we do have other options here, such as expanding token cache schema to indicate what flow an account was established by; I thought that would be more heavyweight fix).

headless Windows isn't a common scenario, I am not sure why we'd want to support this

You brought this up in an offline chat.

Indeed, this use case is rare, and that is probably why this is not (yet?) reported by end users, by only found by Azure CLI team's internal coverage tests.

On the other hand, besides of the "do not violate the token protection promise" design tenet, another implicit tenet is "do not break existing users and scenarios". The latter is the reason why native MSALs already have many fallback logic for scenarios that we know broker/WAM does not support: ADFS, B2C, etc., and device code flow is just one of them. Currently, when broker is enabled, device code flow becomes broken, which is not our intention.

In any case, there is a burden on the app - either they have to remember that DCF is used and follow up with ATS + no-broker or they have to provide guidance to the user (!) to clear the cache.

To me, DCF is never a good solution because it doesn't satisfy device CA. So I wouldn't optimize for DCF.

rayluo commented 1 year ago

In any case, there is a burden on the app - either they have to remember that DCF is used and follow up with ATS + no-broker or they have to provide guidance to the user (!) to clear the cache.

There shouldn't be a burden on the app. With this PR (or other different approaches with similar results), the app does not have to remember and differentiate between DCF and other flows. In particular, Azure CLI's guidance to the user (!) to clear the cache is debatable in itself, is not strictly necessary, and was NOT proposed for this DCF purpose (it was merely proposed as a hygiene measure: to not leave abandoned token cache files behind).

To me, DCF is never a good solution because it doesn't satisfy device CA. So I wouldn't optimize for DCF.

Although this issue was exposed in a DCF scenario, the purpose of this PR is not to optimize for DCF. It is to fix a logical ambiguity. When MSAL calls MsalRuntime's ReadAccountById() and the outcome is semantically an AccountNotFound, wouldn't it be a meaningful indicator suggesting that "this account was not handled by me (broker), not my business, you (msal) shall deal with it yourself"?

MSamWils commented 1 year ago

This is useful when the user somehow chose to use device code flow successfully and now want to do an acquire_token_silent(). The event sequence in this scenario looks like this:

  1. End user enabled broker.
  2. For whatever reason, end user chooses to sign in using Device Code Flow (DCF). Broker (WAM) does not support DCF but, per our existing design, we do not want to break this scenario, so MSAL automatically falls back to use native DCF implementation. This is already happening today, and the sign-in would be successful, and access token (AT) and refresh token (RT) will be stored inside MSAL's native token cache.
  3. Subsequent AcquireTokenSilent() call (especially those after the initial access token expires) would attempt broker code path, and end up with an "account not found" error from MsalRuntime's read_account_by_id() call, because the account was not established via MsalRuntime code path in step 2.
  4. Currently, MSAL's AcquireTokenSilent() surfaces that MsalRuntime error to calling app, and that results in end user's token refresh attempt failure.
  5. Per our recent discussion with Azure CLI team and MsalRuntime team, this PR changes to catch error in step 3, and falls back to native token cache, whose RT will work.

@MSamWils, you do not have to review the implementation details in this PR (which contains some other logistic changes), but please review the summary above.

ReadAccountById returned account was not found means that there is no account corresponding to that ID in MSAL C++ cache. Whether or not how native MSAL interpret and behave is purely based on the scenario. For example:

I agree with @bgavrilMS that if the MSALRuntime do not support the scenario, all the subsequent acquireToken calls should not go to the MSALRuntime. The application could give us the hint or we (in the native MSAL layer) to maintain the hint in the account cache to "remember" we are in this state.

rayluo commented 8 months ago

FYI, @bgavrilMS , et al, you can ignore the lengthy and outdated conversation above, because the controversial bullet point No. 5 in the PR description has been replaced by a new bullet point No. 6.