Azure / azure-functions-python-worker

Python worker for Azure Functions.
http://aka.ms/azurefunctions
MIT License
336 stars 104 forks source link

Enable concurrent execution of Python functions #236

Closed maiqbal11 closed 5 years ago

maiqbal11 commented 6 years ago

I've managed to isolate an issue where a queue trigger function is unable to call an http endpoint from a function with the same app. This behavior manifests when the functions are synchronous or asynchronous.

Repro steps

Provide the steps required to reproduce the problem:

  1. The offending code is here: QueueWithHttpCall.zip. There is a function called QueueTriggerPython which pulls items from queues and triggers a GET request to an endpoint (from the IntakeHttpTrigger function).
  2. Activate virtual environment and install requirements: pip install -r requirements.txt.
  3. Install extensions: func extensions install
  4. Configure local.settings.json to point to the storage account that you are using.
  5. Run func host start and then add item to the configured queue.

Expected behavior

The queue trigger should be able to successfully call into the HTTP endpoint and return the correct status code as well as print out the log message to the console.

Actual behavior

The queue trigger is activated but hangs when trying to call the HTTP endpoint, getting stuck at the following point:

Executing 'Functions.IntakeHttpTrigger' (Reason='This function was programmatically called via the host APIs.', Id=411d2a47-71de-4ab1-b231-c99a9794a7cb)

This is after the point at which the host calls into the language worker to process the request.

Known workarounds

  1. Both QueueTriggerPython and IntakeHttpTrigger are synchronous. Raise the number of workers in dispatcher.py to 2: https://github.com/Azure/azure-functions-python-worker/blob/6068092982479932f1d2d3c315ee6835cdeac8c1/azure/functions_worker/dispatcher.py#L56

  2. Both QueueTriggerPython and IntakeHttpTrigger are asynchronous. No known workarounds.

Related information

For the sync case, there appears to be some issue with the worker threads which are blocking themselves rather than processing the request from a function in the same app. This would explain why raising the number of thread workers to 2 caused the call to go through (one thread for queue trigger and one for the subsequent http call it makes). For the async case, it might be a manifestation of a similar issue since we are executing in the main event loop rather than using a separate thread pool.

maiqbal11 commented 6 years ago

\cc @asavaritayal @1st1 @elprans

asavaritayal commented 6 years ago

@elprans can you investigate this issue?

asavaritayal commented 6 years ago

Also adding @1st1 since this seems to be related to how we're handling the thread pool.

elprans commented 6 years ago

@maiqbal11 Your async case does not work because you are making a blocking http request, which blocks the entire event loop. Use aiohttp to make self-requests and it would work.

As for the sync case, I don't think we can handle this safely. Increasing max_workers to a value greater than 1 opens a huge can of worms, since it makes user code essentially multi-threaded with all the consequences that has.

maiqbal11 commented 6 years ago

@elprans Works as expected when using aiohttp. Thanks for the clarification! Since the sync case can't be handled safely, the recommendation would be to use async constructs when sending requests to the same function app.

maiqbal11 commented 6 years ago

\cc @fabiocav @paulbatum @brettcannon @zooba

maiqbal11 commented 6 years ago

Re-opening as there is some pending discussion on, potentially, raising the number of workers in the thread pool executor.

maiqbal11 commented 6 years ago

Based on a conversation with @paulbatum who had a few questions/ideas about changing the thread pool size. The basic question is whether we should trade off some of the safety guarantees of having single-threaded user code in favor of allowing (potentially non-proficient) users to write sync code that can run more concurrently.

For more context, the issue in this thread was raised by an internal customer with concerns about latency in their calls. It turned out that this particular scenario did not work because there was only one worker in the threadpool and so requests from one function in the app to another could not be processed. It is likely that there will be other customers who face this issue as well - with the expectation that their code will be able to run with some multi-threading that we provide. Each of these customer issues could turn into a support cases that we would need to tackle.

Some discussion questions based on this:

  1. What are the pitfalls that we run into when allowing multi-threaded code and how likely are these for the average user case? This would unblock basic scenarios like the one noted in this issue. @elprans, Perhaps you can elaborate more on this.

  2. Can we expect concurrency by using multi-threaded code even if we do not utilize async/await constructs? I did an experiment on this where I had an HTTP endpoint responding with a 10 second delay to calls (made using Python requests library) from a queue trigger. I ran it for a batch of 4 queue messages with max_workers=1 and max_workers=4. Both cases produced similar performance (~40 seconds to run in total). In the case where max_workers=4, the expectation would be that when the requests.get(url) is called, that the thread waiting for a response would give control to another thread that has not yet been able to send the request. However, that does not seem to be the case as the threads execute in an essentially synchronous fashion: Waiting for the duration of the GET request before another thread gets to execute. I've been unable to find a concrete answer on this but it does look like threads cede control for certain operations (https://stackoverflow.com/questions/14765071/python-and-truly-concurrent-threads). Would really like to hear more thoughts/get clarity on on this.

brettcannon commented 5 years ago

Since I was cc'ed I will say my opinion is to agree with Elvis and say to not up the number of workers and push users towards async/await to both keep the worker simpler and to help users not shoot themselves in the foot with threaded code.

paulbatum commented 5 years ago

Hey @brettcannon thanks for weighing in. I am trying to figure out how to follow your advice, balancing that with the promises we make around functions regarding dynamic scaling, and effective utilization of our hardware. For example, I have concerns that many python users will write their functions using synchronous APIs, and this will result in very poor per-instance throughput (e.g the entire application instance is idle while waiting for a single outbound HTTP request), which will in turn cause us to scale out the application more aggressively, using additional hardware.

Basically what it boils down to is that we want functions to run the user code as efficiently as possible, and this applies for both well written code, and not-so-well written code. Now there are many mistakes that customers can make that we can't correct for, but that doesn't mean we should give up completely.

Can you help me to understand the downside of us allowing multithreaded execution of synchronous python functions within a single application instance? What are some examples of how customers could "shoot themselves in the foot"? I'd like to understand if these examples are unique to Python. In contrast, we allow C# functions to be written to execute synchronously and we rely on .NET threading to provide adequate performance for these scenarios. The stateless programming model for functions means that we can do this without really having to teach users the ins and outs of multithreaded programming.

brettcannon commented 5 years ago

No, there's nothing special here in regards to Python and threads. The only thing to be aware of is CPython's GIL means that CPU-bound code won't see any benefit through threading, only I/O-bound code.

1st1 commented 5 years ago

CPython's GIL means that CPU-bound code won't see any benefit through threading, only I/O-bound code.

Also because of the GIL Python libraries and types aren't always threadsafe. So I'd be extremely cautious to run all code in a multi-threaded mode by default (and that's why I implemented this restriction in the first place.)

paulbatum commented 5 years ago

Several other languages have APIs and surface area that is not threadsafe - C# and Java are both good examples. We don't force those to run single threaded. We rely on customers to either stay in the stateless programming model (and not worry about threadsafety), or to be careful when they go outside the bounds of the stateless model (such by using static variables).

I am not sure I understand why enforcing a single thread of execution is a good tradeoff in the case of Python. There's the hardware utilization concerns I mentioned above, and similarly, customers that choose to run functions on dedicated hardware (such as an App Service plan) are likely to open support tickets reporting poor performance. Tickets that require analyzing the customer code to diagnose are typically expensive.

The number of support cases we've received from C# developers running into threadsafety issues is truly tiny. I think we've already had more cases about poor python performance due to the use of synchronous APIs that do IO.

Any more insights or examples you can share to help me understand your perspective? Do my concerns make sense to you?

brettcannon commented 5 years ago

Threading is just not as big of a thing in the Python community as it is in C# and Java. The GIL is enough of a thing that most people simply don't bother. This means you can't rely on libraries to not stuff things into global state that won't fall over badly in a multi-threaded situation because they never cared about race conditions. I don't know how the worker runs multiple functions, but if you're sharing modules across workers then debugging that will be tough because the interactions won't be from the the code in your function in a single execution but from some other function running simultaneously and modifying things in a strange way. Python has a lot of global state because people never think about this sort of thing. (It also ties into Python putting a priority on developer productivity since threading is not exactly a good way to make yourself write better code 😉 .)

In my opinion, if increasing the number of workers is just a setting flip then I would try it with the current setting and see how users respond. If you say "use async for increased performance" and they still come back decrying the lack of threads then would it be difficult to increase it later on? Compare that to giving threads initially, users complaining about weird bugs in their functions, and then having to scale it back after people have put in the effort to try and make threads work. To me the former is improving things for users (if it comes to that), while the latter is walking back (if it comes to that).

But I'm not maintaining the service or dealing with users and this is all subjective so I unfortunately don't have a magical answer for you short of asking the community how much they want threads in the face of async being available and potential debugging difficulty (I know I personally will only be doing async workloads for scaling purposes 😁 ).

paulbatum commented 5 years ago

@brettcannon Thanks Brett, this helps. Following on a little from your point about what it might make sense to start with and what we could change later, I'm concerned that starting with single threaded mode will tie our hands somewhat in that we could not really switch the default to multiple threads at a later point in time, without the risk of suddenly breaking lots of code that was written without thread-safety in mind. You're right that we could later add some sort of opt-in setting that allows multithreaded execution but that won't help me get effective utilization of our hardware in the case of consumption (I can't guarantee that users will opt-in).

I guess one possibility we could consider that we haven't discussed yet is that we run multiple python worker processes. This is less efficient from a memory utilization perspective, but it would allow concurrency within a single machine without exposing the user to threadsafety issues.

asavaritayal commented 5 years ago

/cc @anirudhgarg

ericdrobinson commented 5 years ago

I'm fairly new to concurrency in Python (the points above about Python and its GIL definitely noted) but I think that I-as-a-user may have bumped into some limitations associated with this issue.

My Use Case

I've built an HttpTrigger Python Function. It works as follows:

  1. Receive HTTP Request for some stored resource based on request metadata.
  2. Check in Azure Storage to see if up-to-date version of resource exists.
  3. If no version exists in Storage (or is out of date), do the following:
    1. Download a file to process to generate the requested resource.
    2. Process the downloaded file. This processing is heavy and can take several seconds.
    3. Upload the newly generated resource to Storage.
  4. Return the resource to the requester of the HTTP Request.

It's fairly simple, has zero shared state between HTTP Requests, and should be "Embarrassingly Parallel™".

The Problem with the Current Approach

It seems that the current processing model for Python may be getting in the way of my use case. My initial, naive implementation left the concurrency up to the automatically handled thread-pool as described in the Python Function docs. When I noticed that responses to 50 simultaneous requests appeared to come in serially, I started to look a bit deeper.

With my current understanding, the logic I outlined above is CPU-bound and will not gain benefit from the default thread-pooling because Python's event loop is simply blocked until the main() function returns. Presumably we could use some async-safe APIs to perform the download in step 3.i above but that, I fear, is not where the majority of time is spent. (For the record we use the azure.storage.blob library to perform the check in step 2.)

If I recall correctly, Python Functions is still in Preview (along with Azure App Service for Linux on which it runs) and does not yet support horizontal scaling (at least beyond 2 instances?).

Attempting to Work Around the Limitation

I began looking into Python's asyncio stuff and came across the loop.run_in_executor call. The example there provides the following:

import asyncio
import concurrent.futures

def cpu_bound():
    # CPU-bound operations will block the event loop:
    # in general it is preferable to run them in a
    # process pool.
    return sum(i * i for i in range(10 ** 7))

async def main():
    loop = asyncio.get_running_loop()

    # ...

    # 3. Run in a custom process pool:
    with concurrent.futures.ProcessPoolExecutor() as pool:
        result = await loop.run_in_executor(
            pool, cpu_bound)
        print('custom process pool', result)

Cool. Theoretically I should be able to wrap the bulk of the work I do (downloading and processing) into a separate process and then wait patiently until it's done, allowing other threads to handle incoming requests (at least until the maximum number of processes in the process pool is reached).

I adapted the above to my own Python Function and have found that it doesn't work when run locally. I'm not entirely certain why, at this point. It appears to deadlock the instant it reaches the run_in_executor call. When I run it locally in VSCode using the runFunctionsHost command and then curl a request its way, I get a stream of OSError: [Errno 9] Bad file descriptor errors sprinkled with the occasional AssertionError: can only join a child process error.

I will also mention that if I simply replace the ProcessPoolExecutor with a ThreadPoolExecutor then everything works fine.

Does anyone have any insight as to what may be going on here?

Some Responses to Previous Comments

@brettcannon

No, there's nothing special here in regards to Python and threads. The only thing to be aware of is CPython's GIL means that CPU-bound code won't see any benefit through threading, only I/O-bound code.

As a user, this type of note would have been really nice to see mentioned in the Async section of the documentation. A suggestion on what to do when you do have CPU-bound code would be further helpful! :D

@paulbatum

I guess one possibility we could consider that we haven't discussed yet is that we run multiple python worker processes. This is less efficient from a memory utilization perspective, but it would allow concurrency within a single machine without exposing the user to threadsafety issues.

I'm going to go out on a limb here and say that this is basically what I'm looking to do with my ProcessPoolExecutor attempts outlined above. This would absolutely be preferable to having no decent way for a single [likely multi-core?] machine to handle more than a single request at a time with CPU-bound code...

In Summary

My expectations when working with Azure [Python] Functions was that I could build a system that would scale based on demand, if not by handling multiple requests within a single instance, then by scaling out the fleet of running instances to handle the load (this latter part somewhat controlled by looking at active request queue depth). At present, the system appears to hobble with 50 concurrent requests, providing little benefit over a single VM running on a single core... What can we do to speed up such workflows?

polarapfel commented 5 years ago

This is one of the most insightful issue discussions I've come across in a while. Thanks to everyone providing their insights and reasoning. I would really appreciate if part of the outcome of this issue is taking agreed upon insights and position these as guidance in the documentation for Python Azure Functions developers. Thanks!

ericdrobinson commented 5 years ago

I have done a bit more exploration/experimentation and have some findings to report that expands/corrects some of what I wrote in my previous comment.

The vscode-python Extension Doesn't Like ProcessPoolExecutor

I mentioned in my previous comment that I was running into issues using ProcessPoolExecutor with my Python function. Specifically:

When I run it locally in VSCode using the runFunctionsHost command and then curl a request its way, I get a stream of OSError: [Errno 9] Bad file descriptor errors sprinkled with the occasional AssertionError: can only join a child process error.

I tested this a bit more and found that the issue wasn't on the Functions side, but the vscode-python extensions'. There were hints in the stack trace spew that came with the errors I mentioned above but I was overwhelmed by them at the time. I have since opened new Issue (microsoft/vscode-python#4684) in the vscode-python repository with a short script that reproduces the problem.

[As a side note, what's even more strange, is that the error spew I mentioned only happens if you trigger the Python Function before the host reports Host lock lease acquired by instance ID '...'. If you wait until that text appears then you will skip any of the error reporting and experience the result reported in the bug: deadlock.]

Quirks of Using ProcessPoolExecutor with Python Functions...

Publishing a version of my Python Function that uses ProcessPoolExecutor proved that there isn't an issue with the API in and of itself. Unfortunately, it did show that there were two unfortunate consequences:

  1. No log reporting from the offloaded function call. I'm not familiar enough with the environment to say for certain, but I imagine that the process to which the dispatched function gets assigned has its own reference to a distinct logger object. What this means is that the logging I expected to see about all the heavy processing is simply... gone.
  2. *Overall worse performance._ The code as written** (see the next section)_ in my initial comment created and utilized a local instance of a ProcessPoolExecutor for each invocation of main(). While the calling context was able to control the number of requests handled, each call to main was blissfully unaware of the other calls. This means that fifty requests (the standard number that get triggered at once in my use case) would result in fifty ProcessPoolExecutor objects being created, each suggesting that they had processor-count "process slots" to fill. This would result in all 50 requests being triggered at the same time, resulting in lots of simultaneous processes that overtaxed the system (and jammed up the CPU with context switches and constant cache misses). Where I expected to see more frequent responses from the function endpoint, they came in far less frequently than they had before the process pooling implementation. More often than not, the HTTP requests would receive 502 responses indicating that the invocations had reached their time limits.

*__** Retaining Performance with Different Code

The problem with the initial code I suggested is that it creates a function-local instance of the ProcessPoolExecutor object. A solution is to define the ProcessPoolExecutor that you use outside of the main function's context (e.g. at the module level) such that each invocation of main can refer to the same instance.

Working with this theory, I adjusted the code to look as follows:

import asyncio
import concurrent.futures

# Process Pool to be shared across invocations of the main coroutine.
singleton_pool = concurrent.futures.ProcessPoolExecutor()

def cpu_bound():
    # CPU-bound operations will block the event loop:
    # in general it is preferable to run them in a
    # process pool.
    return sum(i * i for i in range(10 ** 7))

async def main():
    loop = asyncio.get_running_loop()

    # Run in a custom process pool:
    result = await loop.run_in_executor(singleton_pool, cpu_bound)
    print('custom process pool', result)

[I freely admit that the current approach elides the "safety"(?) provided by the with statement. I would be grateful for any input on how to improve this.]

With this setup, not only did the performance improve, but it was better than without any process pooling at all! Rough estimates put the speedup I'm seeing at roughly 2x for the given workflow (average of 50 requests with varying targets). For the record, some quick logging showed that my Python Function was being executed on a system sporting 4 cores.

Where This May Cause An Issue...

The added performance I'm seeing is great. I fear, however, that I've set myself a trap with this approach. If the parent Function's "feeder" process sees that this machine is accepting as many requests as can be handed to it, then it may not "perceive" the need to scale out as it perhaps should based on resource utilization (mem/cpu). I am not certain which takes precedence here.

I just found the Scalability Best Practices documentation, which may provide more insights and workarounds to this concern.

I would very much appreciate any comments/advice from the development team on how to best handle this!

ericdrobinson commented 5 years ago

I've a bit more information to share based on more experience playing around with the async version of my function (as reported my last few comments).

It looks as though the ProcessPoolExecutor approach may not be the performance salve that I previously reported. In testing, when I run my function in a "worst case" context, it appears that the non-async version actually runs far faster than the async version. The "worst case" context has the following characteristics:

  1. Compressed file downloads of ~10-30MB.
  2. Uncompressed file size of ~120-350MB (which is then actually processed).

I'm not sure at this point if the problem is that the resources on the host system are being overloaded with such taxing workloads (especially simultaneously) or if there's some other issue at play here (e.g. if the core ProcessPoolExecutor communication is somehow slower than anticipated due to internal signaling mechanisms? [doubtful]).

All of the above said, with more "average" workloads (or those that are cached already), the async version does appear to respond much faster. Perhaps it really is resource contention? If it was resource contention, however, my expectations would be that the system would automatically "Scale Out". (I've asked about how to verify scale out with Python functions in another issue. See: #359.)

Does anyone have any insight or suggestions for determining what the root cause of apparent slowdown might be? Is there a way to monitor CPU/memory usage of active functions in Azure?

polarapfel commented 5 years ago

Hey everyone,

I am fairly new to Python, but familiar with Java and how multi-threading works on the JVM. I've done some reading on Python and how the GIL affects multi-threading because I was not aware of that limitation in CPython.

I am quoting from Julien Danjou's "Serious Python" book:

In the second scenario, you might want to start a new thread for each new request instead of handling them one at a time. This may seem like a good use for multithreading. However, if you spread your workload out like this, you will encounter the Python global interpreter lock (GIL), a lock that must be acquired each time CPython needs to execute bytecode. The lock means that only one thread can have control of the Python interpreter at any one time. This rule was introduced originally to prevent race conditions, but it unfortunately means that if you try to scale your application by making it run multiple threads, you’ll always be limited by this global lock.

So, while using threads seems like the ideal solution, most applications running requests in multiple threads struggle to attain 150 percent CPU usage, or usage of the equivalent of 1.5 cores. Most computers have 4 or 8 cores, and servers offer 24 or 48 cores, but the GIL prevents Python from using the full CPU. There are some initiatives underway to remove the GIL, but the effort is extremely complex because it requires performance and backward compatibility trade-offs.

To me, this seems like a pretty steep hurdle to meet the expectation that Azure Functions will scale out instances of a function within a Function App to maximize most efficient resource allocation per billable compute unit (in either billing plan). It seems to me like threading should be called out as a no-go area from the get-go!

From what I have gathered by reading up in various recent publications on Python concurrent computing, the advice points to multi-processing as opposed to multi-threading, using the multiprocessing module. This essentially works by forking the execution process. So if the execution time of any one job is large enough to warrant the cost of the fork() operation in relation, this gets much better CPU utilization.

If each job is stateless and there is no inter-process communication going on between the jobs on different processes, that should fit the programming model of Azure Function well, right? Will the Function host allow forking new processes with the multiprocessing module from within a function in an Function App? Does this depend on how many other Functions are defined within the same Function App in a similar manner? How will Azure manage auto-scaling of the underlying compute hosts based on CPU utilization through multi-processing?

I'm just getting started with Python in general and the Python implementation in Azure Function v2, so I am most certainly missing a lot of points. It would be great if someone with more Python knowledge could elaborate.

Thanks,

Tobias W.

ericdrobinson commented 5 years ago

@polarapfel Did you read through my previous two follow-up comments [1][2]? Do those help answer your questions?

Are you asking about the current Thread Pool approach used by the azure-function-python-worker to hand requests off to a given function? Or are you asking whether the multiprocessing module can be used within a function?

polarapfel commented 5 years ago

@ericdrobinson Thanks for your responses.

I think I am trying to understand if multi-threading in Python in general is a legit way to introduce true parallelism to execution (from what I've read, the answer seems to be no), hence if that's the case, then why use multi-threading on Azure Functions within a single execution host other to avoid blocking calls. I've compared how Azure Functions with Javascript works and the documentation for Node/Javascript on Azure Function is specifically advising to choose single-vCPU App Service plans. My guess is, that with Node/Javascript, there also is no true multi-threading, the illusion of concurrency is achieved through the event loop. There is literally no benefit of using multiple cores then and parallelism is achieved on Azure Functions by letting Azure scale to any number of independent Function App invocations on as many single vcore hosts as needed. The same probably applies to Python in that same context?

As to the multiprocessing module within a function: assuming I do choose an App Service plan with a host SKU that has multiple cores, utilizing them with multi-threading won't work. Let's say I have a CPU intensive task on a queue of items (that I want to batch) that benefits from parallel execution, each atomic task is complex enough that its execution is way more expensive than forking a process and tasks do not need to communicate with each other. In that case, I would gain by being able to use the multiprocessing module, right?

In the end, my ask comes down to this: the outcome of this ticket should be thorough guidance for Python developers where to implement parallelism within a function and where to rely on the (auto)-scaling of Azure Function. And as a second part, when parallelism is implemented within a function, providing some detailed guidance as to which approaches work and which won't.

ericdrobinson commented 5 years ago

@polarapfel Some followup responses:

My guess is, that with Node/Javascript, there also is no true multi-threading, the illusion of concurrency is achieved through the event loop.

The latest version of Node/JavaScript does support multithreading. Please see the Worker Threads module.

That said, JavaScript processing is a typically single-threaded thing. The asynchronous processing happens thanks to the Event Loop.

There is literally no benefit of using multiple cores then and parallelism is achieved on Azure Functions by letting Azure scale to any number of independent Function App invocations on as many single vcore hosts as needed.

Not really, no. You can do advanced things with NodeJS that would allow you to take advantage of multiple cores from within a single function instance. While I've not encountered the documentation you're referring to, my guess is that the general wisdom is that most common workloads that you perform with NodeJS have no need for multithreading and therefore more cores are simply wasted.

The same probably applies to Python in that same context?

To some extent, sure. The reasons for this are entirely different but multithreading and multiprocessing are viable options in Python. You just need to be judicious with which you invoke.

As to the multiprocessing module within a function: assuming I do choose an App Service plan with a host SKU that has multiple cores, utilizing them with multi-threading won't work.

That isn't true. That's true IF your workload is CPU-bound (e.g. lots of maths). However, IF your workload is IO-bound (e.g. lots of networking/file IO/etc.) then multithreading will serve you well. You just need to be very careful to ensure that your functions adhere to the model outlined in the documentation.

Let's say I have a CPU intensive task on a queue of items (that I want to batch) that benefits from parallel execution, each atomic task is complex enough that its execution is way more expensive than forking a process and tasks do not need to communicate with each other. In that case, I would gain by being able to use the multiprocessing module, right?

Yes. And you CAN use the multiprocessing module in an Azure function. Debugging such a setup in VSCode appears to be broken at present but running them live does work (as I reported here).

In the end, my ask comes down to this: the outcome of this ticket should be thorough guidance for Python developers where to implement parallelism within a function and where to rely on the (auto)-scaling of Azure Function.

I wholeheartedly agree! :D

And as a second part, when parallelism is implemented within a function, providing some detailed guidance as to which approaches work and which won't.

I agree a little less, maybe? Microsoft shouldn't have to educate people on how Multiprocessing/Threading in Python works. The Python documentation covers most of that. The guidance I would hope to see would be at the level of: "If you have a computationally-intensive task, structure your function like this. If you have an IO-intensive task, structure your function like that. Please see the Python documentation for more on these topics."

polarapfel commented 5 years ago

Hey @ericdrobinson,

here is an example (taken from "Serious Python") for a CPU intensive workload, there is no IO:

import random
import threading
results = []
def compute():
    results.append(sum(
            [random.randint(1, 100) for i in range(1000000)]))
workers = [threading.Thread(target=compute) for x in range(8)]
for worker in workers:
    worker.start()
for worker in workers:
    worker.join()
print("Results: %s" % results)

Running this on an idle CPU with 4 cores looks like this:

$ time python worker.py
Results: [50517927, 50496846, 50494093, 50503078, 50512047, 50482863,50543387, 50511493]
python worker.py  13.04s user 2.11s system 129% cpu 11.662 total

This means that out of the 4 cores, only 32 percent (129/400) were used.

The same workload rewritten for multiprocessing:

import multiprocessing
import random

def compute(n):
    return sum(
        [random.randint(1, 100) for i in range(1000000)])

# Start 8 workers
pool = multiprocessing.Pool(processes=8)
print("Results: %s" % pool.map(compute, range(8)))

Executed on the same idle CPU with 4 cores:

$ time python workermp.py
Results: [50495989, 50566997, 50474532, 50531418, 50522470, 50488087,
0498016, 50537899]
python workermp.py  16.53s user 0.12s system 363% cpu 4.581 total

It results in more than 90% CPU usage (363/400) and a 60% reduction in execution time.

To me, that is pretty compelling evidence that multi-threading does not help with parallel compute tasks in Python - even for CPU only tasks without any IO.

ericdrobinson commented 5 years ago

To me, that is pretty compelling evidence that multi-threading does not help with parallel compute tasks in Python - even for CPU only tasks without any IO.

Sure. But what is your point, exactly? That the Python Function invocations themselves should be spawned into a Process pool rather than a Thread pool?

I'm asking because even with the current model, I believe that you can access those 4 cores from a single Python Function invocation with your own Process Pool implementation. Does that make sense?

polarapfel commented 5 years ago

Sure. But what is your point, exactly? That the Python Function invocations themselves should be spawned into a Process pool rather than a Thread pool?

I guess I have more questions than points here. :)

I'm asking because even with the current model, I believe that you can access those 4 cores from a single Python Function invocation with your own Process Pool implementation. Does that make sense?

I guess it goes back to your point that when it comes to documentation, the expectation can be that Python developers would know about parallel computing when it comes to generic Python language and run-time features, but while it can be does that mean it should be?

Take these two excellent articles on the subject for example:

Data and chunk sizes matter when using multiprocessing.Pool.map() in Python In Python, choose builtin process pools over custom process pools

These are educational in their own right. But applying that knowledge to writing a serverless workflow really depends. It depends on the nature of the processing work, queuing strategies, which sandbox limits are relevant, what App Service plan you're on, which host SKU you choose and the list goes on. I think we need detailed guidance in writing if we want the average developer to get the most out of Azure Functions.

brettcannon commented 5 years ago

@ericdrobinson @polarapfel as a Python core developer I can tell you that under CPython -- which is what Azure Functions runs under -- you will not gain anything with a CPU workload with threads. The only benefit to threads under CPython is when I/O is blocking in a thread, allowing another thread to proceed (which is the same effect as using async/await except more explicitly).

maiqbal11 commented 5 years ago

Work completed. Pending documentation tracked here: https://github.com/Azure/azure-functions-python-worker/issues/471.

ericdrobinson commented 5 years ago

Can't wait to see that documentation!!

Also, a quick update on one of my last "reports":

It looks as though the ProcessPoolExecutor approach may not be the performance salve that I previously reported. In testing, when I run my function in a "worst case" context, it appears that the non-async version actually runs far faster than the async version.

It tuns out this was caused by the processing I mentioned in step 3.ii of My Use Case. The "processing" that I do actually involves running ffmpeg to decode a file. The decoding itself is multi-threaded. Attempts to decode multiple files at the same time invoke multiple ffmpeg processes, each spawning multiple threads. This leads to an overall degradation of performance.

In my specific case, we "worked around" this limitation by simply restricting the "heavy processing"... process... to a single instance. I should note that I have yet to try the -threads option in FFMPEG to see if we get any gains by opening things up a bit.

Regardless, I'm very keen on seeing what work came of this!

tomhosking commented 5 years ago

@ericdrobinson Thanks for all your investigation work on this!

Following on from your approach here, I had 2 questions:

1) How were you able to use asyncio.get_running_loop() given that it was added in 3.7, but the current Azure worker is 3.6?

2) Did you find a way of "re-attaching" logging from the other processes to the main Azure logger?

ericdrobinson commented 5 years ago

@tomhosking Answers below:

How were you able to use asyncio.get_running_loop() given that it was added in 3.7, but the current Azure worker is 3.6?

I used the following:

loop = asyncio.get_event_loop()

Did you find a way of "re-attaching" logging from the other processes to the main Azure logger?

Unfortunately, no. If memory serves I might have been able to watch the standard output stream when testing the function on my local machine. I also recall creating a specific entry in the dictionary returned by the process that included any debug output (all functionality was wrapped in a try-except block just in case). That info would then get sent to the main process log for inspection. It was a kludgey workaround but it got the job done.

Would definitely love a better solution to this...

tomhosking commented 5 years ago

Ah, makes sense! Thanks.

It seems like as per this PR, setting FUNCTIONS_WORKER_PROCESS_COUNT > 1 does indeed enable multiple workers, and they seem to behave as I would expect (ie multiple requests handled, logging works) from very brief testing.

ericdrobinson commented 5 years ago

Ooooh, awesome! This entire approach may not even be necessary anymore!!

Will have to check it out at some point. Our limiting factor turned out to be ffmpeg eating as many threads as it could grab. As such there's little to be gained from trying to run multiple ffmpeg processes on the same CPU[set]...

abkeble commented 5 years ago

We have been looking at this thread to find similar answers, and we were expecting to be able to run multiple calls to the same function in parallel. Simply setting the FUNCTIONS_WORKER_PROCESS_COUNT to the maximum of 10 still seems very limited as we would want to be able to scale to 100s of functions running in parallel.

paulbatum commented 5 years ago

@abkeble I think you've misunderstood what this setting does exactly so let me clarify. The short answer is that the Azure Functions platform can scale your app to run tens of thousands of concurrent executions.

The design of how the functions infra calls into your python process to run your code absolutely allows for concurrent executions within a single python process. However if the code is written using APIs that block, then concurrent execution within that process is prevented. This thread has been about how we can make it so that the system naturally uses available resources even in cases where uses write this type of code.

The approach we've come up with is to provide support for running multiple python processes on a single VM. Its a setting that you can use directly today, but in the future we'll make the system smart enough to dynamically adjust. If you have an IO bound workload and write good async python functions, its wasteful to run multiple python processes, so in that case the system would stick with just one process per VM. We've aggressively set a low limit on this setting (currently 10) because its very easy to hit the natural memory limits of azure functions as you create additional separate python processes.

Our scale out architecture is based on running your function across many VMs. Your functions can scale to hundreds of VMs to get the needed level of concurrency.

ericdrobinson commented 5 years ago

Your functions can scale to hundreds of VMs to get the needed level of concurrency.

@paulbatum No, they actually can't. At least today. Azure Functions for Python is still in Preview (see the Note at the top of this page).

Last time I checked/read, we are currently limited to 2 concurrent VMs. The maximum number of concurrent processes, then, is 20.

Any word on how much longer AFfP will be in Preview? Is it getting close, at least?

paulbatum commented 5 years ago

@ericdrobinson For the linux consumption plan, we have a deployment in progress that will increase the limit on concurrent VMs to 20. Rough ETA for global deployment of this change is 7/22. We expect to continue to increase this limit in subsequent deployments. I can't provide any specifics around when the python offering will exit preview, but yes, we are getting much closer.

ericdrobinson commented 5 years ago

@paulbatum That's excellent news! Excited to see it happen!

Will this perhaps coincide with a fix for #359?

abkeble commented 5 years ago

@paulbatum I don't believe this has been released yet, is there any update on when we can expect this? Thanks!

paulbatum commented 5 years ago

@abkeble Are you referring to the change of how many VMs you can concurrently execute on? Because my understanding is that is now live everywhere. If you're not seeing the behavior you're expecting, can you file a new issue, include app name and timestamp, and then mention me or leave a link here?

balag0 commented 5 years ago

@abkeble How are you verifying the number of scale out instances. If it is through app insights, we have a known issue where app insights always shows 1 server instance live - https://github.com/Azure/azure-functions-python-worker/issues/359 The fix for that issue will begin deploying later this week.

If you can share the sitename and timestamps i can also make sure there are no other issues with the scale out.

abkeble commented 5 years ago

@balag0 @paulbatum We were looking at the app insights value which is what misled us. We are seeing an increased number of VMs now, however this is only looking like 3 or 4 in total from the 1 or 2 we were seeing previously. @balag0 Would you be able to confirm this? Timestamp: 2019-07-31 12:59:51.344727 BST Sitename: https://tempnwg.azurewebsites.net/

We are testing the scale out by running a function that sleeps for 5 seconds. We run this function (Named SlowFunction) 20 times concurrently and are getting batches of 3 or 4 responses every 5/6 seconds. This is what leads us to believe there are only 3 or 4 instances running. Whereas we would expect the function calls to be returning in batches of 20 if the consumption plan had scaled out to 20 instances. We have set FUNCTIONS_WORKER_PROCESS_COUNT to 1 to simplify the testing.

ericdrobinson commented 5 years ago

For anyone interested, the FUNCTIONS_WORKER_PROCESS_COUNT app setting is documented here.