Expose http_cache parameter, with its docs and recipe

rayluo commented 2 years ago

This PR implements the internal proposal of http cache persistence.

Because MSAL 1.14's in-memory http cache behavior (i.e. #379) was specifically designed with a mechanism of "using a generic key-value pair storage as http cache", all the heavy-lifting has already been done in the MSAL 1.14. This PR adds only 1 line to expose the new http_cache parameter, and another line to wire it up. The rest of change is documentation, available for proofreading (please searching http_cache keyword in this doc_staging environment).

When app developer utilizes this PR's new parameter to use a persisted http cache, MSAL Python will cache those internal "instance metadata" across different PublicClientApplication and/or ConfidentialClientApplication, therefore resolves #80. Consequently, it also resolves #334. UPDATE: Thanks @jiasli for testing and confirming that this PR also resolves #334.

jiasli commented 2 years ago

I encountered failure while adopting the http_cache:

Traceback (most recent call last):
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 111, in __getitem__
    value = self.cache[key]
KeyError: '_index_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\cli-msal\knack\knack\cli.py", line 231, in invoke
    cmd_result = self.invocation.execute(args)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 657, in execute
    raise ex
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 720, in _run_jobs_serially
    results.append(self._run_job(expanded_arg, cmd_copy))
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 691, in _run_job
    result = cmd_copy(params)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 328, in __call__
    return self.handler(*args, **kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\command_operation.py", line 121, in handler
    return op(**command_args)
  File "d:\cli-msal\azure-cli\src\azure-cli\azure\cli\command_modules\profile\custom.py", line 147, in login
    subscriptions = profile.login(
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 184, in login
    subscriptions = subscription_finder.find_using_common_tenant(username, credential)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 735, in find_using_common_tenant
    specific_tenant_credential = identity.get_user_credential(username)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\identity.py", line 193, in get_user_credential
    return UserCredential(self.client_id, username, **self._msal_app_kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\msal_authentication.py", line 26, in __init__
    super().__init__(client_id, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 1447, in __init__
    super(PublicClientApplication, self).__init__(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 420, in __init__
    self.authority = Authority(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 83, in __init__
    openid_config = tenant_discovery(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 142, in tenant_discovery
    resp = http_client.get(tenant_discovery_endpoint, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 263, in wrapper
    return self._mapping[key]
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 139, in __getitem__
    sequence, timestamps = self._mapping.get(self._INDEX, ([], {}))
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 106, in get
    return self[key]
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 114, in __getitem__
    value = Unpickler(f).load()
_pickle.UnpicklingError: pickle data was truncated

This is because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance. According to

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L215-L221

            # Note that _index may be out of synch with the directory
            # file now:  _setval() and _addval() don't update the directory
            # file.  This also means that the on-disk directory and data
            # files are in a mutually inconsistent state, and they'll
            # remain that way until _commit() is called.  Note that this
            # is a disaster (for the database) if the program crashes
            # (so that _commit() never gets called).

once MSAL changes the content of _index_, the dbm.dumb's dat is updated but dir is not. If another dbm.dumb reads from the out-of-sync dat and dir, the error will occur.

jiasli commented 2 years ago

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from https://github.com/python/cpython/commit/9f824a7984fcd98d00dfc795e6a1d95317fd4b93 created in 1995)

"""A dumb and slow but simple dbm clone.

For database spam, spam.dir contains the index (a text file),
spam.bak *may* contain a backup of the index (also a text file),
while spam.dat contains the data (a binary file).

XXX TO DO:

- seems to contain a bug when updating...

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

- support concurrent access (currently, if two processes take turns making
updates, they can mess up the index)

- support efficient access to large databases (currently, the whole index
is read when the database is opened, and some updates rewrite the whole index)

- support opening for read-only (flag = 'm')

"""

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235

        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?

Problems:

Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.
As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

jiasli commented 2 years ago

Alternative solutions I can think of is

Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.
ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (#334).

rayluo commented 2 years ago

Thank you, Jiashuo, for your in-depth report, as usual! Let me answer your questions backwards.

Alternative solutions I can think of is

(2). ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (openid-configuration HTTP request slows down MSAL #334).

This PR introduces a generic http cache concept. When an instance of http_cache is provided by app developer, MSAL would utilize such a newly obtained capability to cache some http responses in some certain situations. That tenant discovery is just one of the situation that can also benefit from http_cache. There are other less frequent situations that MSAL would still need an http_cache (which we can discuss in details offline). In that sense, an alternative to tenant discovery only is not a decisive factor for us to withdraw the effort on http_cache.

(1). Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.

The current PR, at its core, is defining/exposing a generic http cache as a dict-like interface. It does not tie to any specific implementation of such a dict-like object. The shelve was chosen as a "recipe" only, because it is already in standard library and it (is supposed to) already behave like a dict, so our recipe looks concise. But we can always switch to any equivalent implementation, if we need to. Searching "python shelve alternative" can find many options.

An earlier prototype of recipe was indeed based on dict + pickle.dump(...) therefore bypass the dbm. You could also choose dict + pickle.dumpS(...) + MSAL EXtensions's FilePersistence which was designed for a (token cache) file shared among multiple concurrent and potentially long-lived processes (but then you would need to somehow utilize its time_last_modified() and/or even use a lock).

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from python/cpython@9f824a7 created in 1995)
"""A dumb and slow but simple dbm clone.

...

XXX TO DO:

- seems to contain a bug when updating...

...

"""

Thanks for bring me to the amusing history of Python dbm. I wasn't aware your term "its very developer" means the GvR. :-)

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235
        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?
Problems:

Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.

But it seems only those beginning lines were added at the first commit and then never changed. All the implementation have been changed since then. In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

True. If you/we foresee multiple Azure CLI will be run concurrently, Azure CLI can consider use msal_extensions.cache_lock.CrossPlatLock. That lock was also designed to be a generic, self-contained module in itself, so you could also use shelve + msal_extensions.cache_lock.CrossPlatLock combination. If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance.

This might be the very specific reason for your running into the issue. Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

jiasli commented 2 years ago

In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

As I tested and inspected the source code, this task still is not implemented:

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

Consider a very simple snippet:

import dbm.dumb

db = dbm.dumb.open(r'D:\testdbm')

db['mykey'] = 'a' * 512
db['mykey'] = 'b' * 1024

Running this script multiple times will make D:\testdbm.dat keep growing. When used as HTTP cache, space occupied by expired entries will not be released, making the cache grow indefinitely.

If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

This would certainly be a nice-to-have feature if MSAL wants to support concurrency.

Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

This is exactly what I am going to do next.

jiasli commented 2 years ago

On Linux, dbm.gnu has no concurrency support either.

Consider a simple script:

# testdbm.py
import dbm.gnu
import time

db = dbm.gnu.open('testdbm', 'c')
time.sleep(60)

# Put it in background
$ python3 testdbm.py &
[1] 1007

# This will fail
$ python3 testdbm.py
Traceback (most recent call last):
  File "testdbm.py", line 4, in <module>
    db = dbm.gnu.open('testdbm', 'c')
_gdbm.error: [Errno 11] Resource temporarily unavailable: 'testdbm'

I also see this:

https://docs.python.org/3/library/shelve.html#restrictions

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing.

rayluo commented 2 years ago

On Linux, dbm.gnu has no concurrency support either.

OK, good to know. Now, given that you mentioned there could potentially be multiple Azure CLI instances running on the same machine, the goal becomes "to find a dict-like component that supports concurrent access". We may be able to sieve and find some existing packages from PyPI to see if they would fit, or we may eventually develop a new package for that. But for now, perhaps this recipe would be a good-enough workaround.

jiasli commented 2 years ago

Just want to mention that the file should be opened in binary mode b:

https://docs.python.org/3/library/pickle.html#pickle.Pickler

class pickle.Pickler(file, protocol=None, *, fix_imports=True, buffer_callback=None) This takes a binary file for writing a pickle data stream.

jiasli commented 2 years ago

Also, writing to the same file with 2 processes can result in corrupted file content (such as https://github.com/Azure/azure-cli/issues/9427). Unpickling from corrupted binary file will result in UnpicklingError.

rayluo commented 2 years ago

Just want to mention that the file should be opened in binary mode b:

Unpickling from corrupted binary file will result in UnpicklingError.

Thanks for the catch! Your both comments were correct, and I have now updated the previous recipe accordingly. Please give me a confirmation, and then I'll update it into this PR, and then merge it in.

jiasli commented 2 years ago

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

rayluo commented 2 years ago

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

Good finding, and an intriguing topic.

There is no way for us to list ALL exceptions that could potentially be raised by pickle.load(...). The official doc clearly mentioned that "including but not necessarily limited to".
But, it perhaps does not matter. My interpretation is that all those "including but not limited to AttributeError, EOFError, ImportError, and IndexError" are lower level exceptions that would possibly only happen when a bug exists inside the pickling-and-unpickling round trip. We don't really care about them, we care the expected UnpicklingError caused by "such as a data corruption". That being said, it doesn't harm to include them in your implementation.

jiasli commented 2 years ago

I admit the Python document is pretty vague about the error.

only happen when a bug exists inside the pickling-and-unpickling round trip.

We don't know what exactly the bug is. In other words, we are not sure which kinds of corrupted files result in which kinds of exceptions - it may be possible the file is corrupted in one way which triggers UnpicklingError, and the file is corrupted in another way which triggers EOFError.

For example, unpicking from an empty file results in:

Traceback (most recent call last):
  File "D:\cli\testproj\main.py", line 4, in <module>
    dic2 = pickle.load(f)
EOFError: Ran out of input

AzureAD / microsoft-authentication-library-for-python

Expose http_cache parameter, with its docs and recipe #407