Missing documention on thread safety/concurrent usage of SDK

petergaultney commented 1 year ago

Describe the bug

I've encountered a lot of issues trying to get consistent performance out of the Azure Python SDK when using thread-based parallelism to keep concerns separate within the application. When using 10 threads in parallel, it often takes 10 or more seconds to make simple data_lake.file_exists checks. I suspect this has something to do with the intended usage of the SDK when running in threaded contexts, but I can't find any documentation on the subject.

For instance, it's unclear whether it's safe to use module-level (singleton) DefaultCredential objects. In the past I did not, and I ran into auth issues that looked a lot like what was described in #26177. So I switched to using a singleton credential and that seems to work better, though I ended up putting a Python threading.Lock around get_token to be on the safe side.

Similarly, it's unclear whether DataLakeServiceClient or FileSystemClient are designed to be shared across threads, or what resources they consume when not used as context managers (e.g. with ds_service_client.get_file_system_client(...) as file_system_client). They seem designed to be used as context managers, but the docs don't mention anything about the consequences of not using them that way, or of sharing them across threads. As you know, you can't easily 'share' open contexts across different parts of your application.

I am beginning to suspect that a lot of my performance issues come from having too many short-lived clients being created, but I'm struggling to find any examples or documentation on what is expected in multithreaded usage.

I'd appreciate some guidance, but more than that, I'm sure the userbase at large would appreciate some documentation about intended usage of this library.

annatisch commented 1 year ago

Hi @petergaultney, This is a great question - and one that we have not documented well, so thank you for posting! With regards to our general policy on threadsafety - each client should be threadsafe. The way we have constructed the underlying HTTP request/response pipeline is intended that it can be called from multiple threads at once. That's not to say we may not have some bugs here and there though - so we definitely appreciate any reports of poor/unexplained performance so that we can investigate deeper!

Regarding the context manager nature of the clients, the code in the SDK itself (i.e. the pipeline I mentioned above) does not itself hold open any resources (there are a couple of exceptions here) - however we use underlying transport libraries (for example requests or aiohttp that will hold open a connection pool to reuse connections for optimal performance. The consequences for not using the client in a context (or otherwise close it after use) vary depending on which underlying transport is being used:

requests (the default for synchronous clients): The "open resource" in this case will be the requests.Session. This Session will be opened on first use (regardless of whether the context syntax was used), and requests do not really document the consequences of not closing it, nor do they imply in their documentation that it's strongly recommended. For example, their snippets and samples do not always demonstrate closing.
aiohttp (the default for asynchronous clients): The "open resource" in this case will be the aiohttp.ClientSession. Like requests, this Session will be opened with the first service call made, but unlike requests, aiohttp will actually issue warnings if the Session is left unclosed. So not cleaning up the resources can become very noisy.

So depending on the transport used (we also offer support for other options beyond the above two defaults), the resources being managed within the context may vary - therefore it's our general recommendation that all clients make use of the context manager as "best practice" so that whatever transport is being used will be correctly handled.

Regarding DefaultAzureCredential - the reason this also has a context manager, is because this object also owns an instance of a client for working with various authentication workflows, and so it's also ultimately wrapping one of the above transports.

With all the above being said, our general recommendation is fewer clients being preferred, as the initial setup of the connection pool will be more expensive than reusing it. All of this is general advice, though depending on your architecture and needs we may be able to provide you with more specific guidance or help diagnose any threading/performance problems. I will tag this issue under Storage (as you mentioned a couple of the Storage SDKs) and include that team - feel free to provide more details on your application and we will do our best to assist!

cc @jalauzon-msft, @vincenttran-msft

petergaultney commented 1 year ago

Thank you - this is really helpful and confirms some things I've determined experimentally recently.

I've ended up writing a very brute-force test suite to help me make this decision for myself. I'm using ADLS as a key-value store with no added operational burden (we also need to be able to store large blobs with the same system), and I had gotten to the point where I was saturating my throughput locally, apparently because of creating too many clients. This had also bitten me in the past when I tried to keep separate credentials, I think for the same reason - each credential was separately performing the network calls associated with Managed Identity on AKS.

I wrote separate code to test whether sharing a single client with a shared credential would work, and somewhat to my surprise, it worked quite well up through the limit of my testing so far, which is about 1200 threads.

My workload follows a "setup, wait for result, retrieve result" flow for each thread - a thread will perform a couple of ADLS operations, wait for a Kubernetes job to finish, and then check for the result of that Kubernetes operation. When these jobs are short, the orchestrator (which uses no multiprocessing) can spend most of its 1 CPU communicating with ADLS and eventually starts being unable to keep up with the jobs.

Switching to a single client in my case improved overall throughput by 6x, from 100 parallel threads all the way through 1200.

It may be worth experimenting with a small but >1 # of clients at some point (I use a semaphore to prioritize N logical threads of execution getting completed before any other threads get halfway started), and if I do that I'll try to report back.

Hopefully these comments will pop up in some other people's searches, and eventually it would be great to see this documented officially.

annatisch commented 1 year ago

Awesome - thanks for update @petergaultney! Please let us know if there's a specific scenario we can help improve.

@lmazuel - what might be the best way to start an effort to add Client-level documentation regarding the use of context managers and thread-safety? I feel at a minimum there's probably a standard blurb we could add to each client to provide some basic details here.

(Removing the Storage tag for now)

ks6088ts commented 2 months ago

what might be the best way to start an effort to add Client-level documentation regarding the use of context managers and thread-safety?

Hi @lmazuel

How is the progress here? One of my customers has an API server implementation that calls DefaultAzureCredential() on every API call and was encountering blocking issues. If changes at the library level are not possible, I think it would be desirable to improve the documentation and sample code.

pvaneck commented 2 months ago

Hey, @ks6088ts. I agree that some documentation would be great. I hope to have a draft out regarding general client/credential lifetime management and thread safety sometime next week.

Azure / azure-sdk-for-python

Missing documention on thread safety/concurrent usage of SDK #28665