Parsl / parsl

Parsl - a Python parallel scripting library
http://parsl-project.org
Apache License 2.0
508 stars 195 forks source link

Use parsl/HTEX within uvicorn web server #2452

Closed tcompa closed 2 years ago

tcompa commented 2 years ago

Context and brief summary

While working on fractal-server, we recently noticed an unexpected behavior: when running parsl (with HTEX executor) within a uvicorn web server, some part of the DataFlowKernel.cleanup method would shutdown our active server. We discovered the error is not related to parsl, but rather to the use of multiprocessing from within a uvicorn server. You can find here a minimal reproducible example below, and our fix based on a small change to parsl/executors/high_throughput/interchange.py. Since this "bug" is not related to parsl, we are not asking for any change here. This issue is to document our findings/tests, and can be closed right away.

This is mostly work done with @mfranzon, and the corresponding issue in our repo is here.

Minimal reproducible example

We setup a fresh python 3.8.13 environment with fastapi 0.85.0, uvicorn 0.18.3 and parsl 2022.10.10. We use the following main.py script

from fastapi import FastAPI
from parsl.dataflow.dflow import DataFlowKernel
from parsl.config import Config
from parsl.app.python import PythonApp
from parsl.executors import HighThroughputExecutor
import math

app = FastAPI()

@app.get("/")
async def root():

    htex = HighThroughputExecutor()
    conf = Config(executors=[htex])
    dfk = DataFlowKernel(config=conf)
    app = PythonApp(
        math.cos,
        data_flow_kernel=dfk
    )
    res = app(1).result()
    dfk.cleanup()
    return {"message": f'{res}'}

which we run from the command line via

uvicorn main:app

Now our web server is active, and any POST request will trigger one execution of root(). When we call this endpoint via

curl 127.0.0.1:8000/

the uvicorn server receives some signal that shuts it down, and the logs (in the terminal) show

INFO:     127.0.0.1:52626 - "GET / HTTP/1.1" 200 OK
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [71313]

The server is not reachable any more, which is a critical problem for our use case (fractal-server should open and close several DFKs with HTEX executors, and obviously it should remain active).

Source of the problem

We realized that the problem comes from using multiprocessing from within a uvicorn server. This is explained by @Mixser in

Quoting from those comments:

We have a master process (uvicorn) that listen and accept request on the port. During server initialization, unicorn setups signal listeners (by using asyncio.add_signal_handler calls) for graceful shutdown of the application. But it uses a file descriptor event notification instead default approach (see https://docs.python.org/3/library/signal.html#signal.set_wakeup_fd). After initialization, we have an opened socket (which has been used in the signal.set_wakeup_fd call) and our main process waits for data from this socket. Any received data will be interpreted as a signal.

Next, we are creating a new process in our HTTP handler -- this new process will inherit this opened socket (opened file descriptor), and we are waiting for signals from this socket.

When we are sending a signal to the child, it goes to this opened socket. But our parent process listens to this socket too, so it receives a signal to terminate and shut down the application.

How to avoid -- we need to return the default behavior of signal handlers for child process and don't use the inherited fd from the parent; We can achieve this by calling signal.set_wakeup_fd(-1)

The reason why this appears in our example is that parsl HTEX executor does use a ForkProcess, in https://github.com/Parsl/parsl/blob/dfcbff8f633f33aa2aeca68ba5b13bfd66bc223a/parsl/executors/high_throughput/executor.py#L477 where ForkProcess is simply https://github.com/Parsl/parsl/blob/dfcbff8f633f33aa2aeca68ba5b13bfd66bc223a/parsl/multiprocessing.py#L15

Workaround (by patching parsl)

Based on the fastapi/uvicorn issues above, we modified parsl HTEX interchange by adding the lines

    import signal
    signal.set_wakeup_fd(-1)
    signal.signal(signal.SIGTERM, signal.SIG_DFL)
    signal.signal(signal.SIGINT, signal.SIG_DFL)

at the beginning of the starter function. This is visible in our fork: https://github.com/fractal-analytics-platform/parsl/blob/1.3.1-dev/parsl/executors/high_throughput/interchange.py. At the moment we didn't notice any side effect of this change in the use of parsl, but we cannot say whether this change will affect parsl in some way.

Tests with recent parsl functionalities

As of https://github.com/Parsl/parsl/pull/2433, we can specify the start_method parameter for a HTEX. This concerns the workers, rather than the interchange, and does not affect our problem here. Just for completeness, we also tested our same example by adding the start_method argument in HighThroughputExecutor: independently on our choice (fork, spawn, thread), the problem remained there. This is not surprising, since that PR only modifies what takes place in process_worker_pool.py.

One could try by replacing ForkProcess with SpawnProcess or Thread, in https://github.com/Parsl/parsl/blob/dfcbff8f633f33aa2aeca68ba5b13bfd66bc223a/parsl/executors/high_throughput/executor.py#L477 Preliminary tests led to frozen server (when using SpawnProcess) and to other compatibility issues when using Thread (AttributeError: 'Thread' object has no attribute 'terminate') - and we did not investigate further.

Summary

We found a way to use parsl/HTEX from within a uvicorn server, and by now we are using it in our parsl fork. If this issue is solved upstream (in uvicorn), we'll gladly switch back to parsl mainline. This issue is here just in case others find the same problem.

benclifford commented 1 year ago

Cross referencing issue #2628, #PR #2629 which deal with a different but related signal handling problem (with PR #2629 applying a similar fix as mentioned above - ensuring that the SIGTERM handler is not inherited from the parent process). This came from a situation where parsl was used inside Globus Compute, and the test suite there was (in other situations) installing a signal handler that was unexpectedly inherited by the interchange.

In general, the interchange should be expecting that the parent process could have set any signal handlers it wants, because it is the user's main process.