DataFlowKernel.cleanup() triggers server shutdown

tcompa commented 1 year ago

After https://github.com/fractal-analytics-platform/fractal-server/pull/88, we would like to cleanup all DFKs after a workflow is complete, but this leads to server shutdown:

2022-09-30 10:23:16,229 INFO sqlalchemy.engine.Engine [cached since 9.306s ago] (2,)
INFO:     127.0.0.1:60020 - "GET /api/v1/dataset/1/2 HTTP/1.1" 200 OK
2022-09-30 10:23:16,230 INFO sqlalchemy.engine.Engine ROLLBACK
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [49824]

The dfk.cleanup() line is currently commented out, but we need to solve this issue to avoid proliferation of processes.

tcompa commented 1 year ago

Some additional info:

Possible cause 1: dfk.cleanup() triggers FlowControl._thread.join()
Possible cause 2: dfk.cleanup() triggers HighThroughputExecutor.shutdown, which triggers self.interchange_proc.terminate(). Maybe change fork to spawn in https://github.com/Parsl/parsl/blob/master/parsl/multiprocessing.py#L15? To check

tcompa commented 1 year ago

Also possibly interesting:

benclifford commented 1 year ago

from a parsl developers perspective, i'm interested in understanding what's going on here: what process is your application server runing in that is getting incorrectly shut down? the main submitting/workflow process or one of the htex worker processes?

tcompa commented 1 year ago

Our application lets users define some workflows (NOTE: "workflow" refers to some custom objects, rather than parsl workflows), which are then executed on a SLURM cluster. We currently use parsl for SLURM integration, and for some very basic dependency handling. The application runs as a server (based on fastapi and served via uvicorn) on the SLURM main node, and then another (client) application lets users interact with it from their local machines.

For each workflow, we'd like our server application to do the following:

Create a DFK instance;
Define a series of python app (associated to that DFK);
Wait for execution (via the result() method of the last app future);
Close that DFK (with its cleanup() method).

Since multiple users can define and submit multiple workflows at the same time, we'd like to have one DFK per worfklow, rather than a single global one. This also helps in getting a clean monitoring, since (in our understanding) parsl-visualize shows an entry per DFK - meaning that all our users' workflows would get mixed up if we use a single DFK.

We implemented our approach and in principle we are satisfied with its modularity (start workflow -> start DFK, end workflow -> close DFK and related processes), up to the problem mentioned in this issue. The main server application calls a cleanup() method for some active DFK, and then the application itself receives a shutdown. The uvicorn shutdown is this one https://github.com/encode/uvicorn/blob/b22c9506b282e9c81591ed7f48781dc5759df0e0/uvicorn/lifespan/on.py#L63, but we still don't know which part of the DFK cleanup() triggers it. We also need to debug a bit further, to see whether this always happens (i.e. at each cleanup), or is related to some more specific conditions.

(probably @jacopo-exact can chime in in the following days with some more details)

mfranzon commented 1 year ago

We (me and @tcompa) realized a minimal example that reproduce the error:

from fastapi import FastAPI
from parsl.dataflow.dflow import DataFlowKernel
from parsl.config import Config
from parsl.app.python import PythonApp
from parsl.executors import HighThroughputExecutor
import math

app = FastAPI()

@app.get("/")
async def root():

    htex = HighThroughputExecutor()
    conf = Config(executors=[htex])
    dfk = DataFlowKernel(config=conf)
    app = PythonApp(
        math.cos,
        data_flow_kernel=dfk
    )
    res = app(1).result()
    dfk.cleanup()
    return {"message": f'{res}'}

tcompa commented 1 year ago

With @mfranzon we narrowed down the error a bit further, and now have a minimal reproducible example. We also explored recent parsl functionalities, with no success.

How to reproduce

Save code from previous comment in main.py.
Install fastapi (we pip-installed version 0.78.0), which also installs the web server uvicorn (we have 0.18.3).

From the command line, run uvicorn main:app. The server is now up and running, as seen in the logs on the terminal

INFO:     Started server process [71313]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Call the root endpoint from the command line, via curl 127.0.0.1:8000/. This leads to the correct execution of the parsl app, but then shuts down the server (see logs below), and from now on we cannot call the endpoint any more.
```
INFO:     127.0.0.1:52626 - "GET / HTTP/1.1" 200 OK
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [71313]
```

Where we stand

The statement triggering the shutdown is dfk.cleanup() (this is clear because if we remove that line then the server is not shut down upon calling the endpoint). The cleanup method internally calls:

FlowControl._thread.join() and FlowControl._kill_event.set().
HighThroughputExecutor.interchange_proc.terminate(), where HighThroughputExecutor.interchange_proc is a ForkProcess.

Our trivial guess is that one of these signals propagates to the process running the server, triggering its shutdown. It is unclear to us why/how this happens. We are digging further around the shutdown on the server side, to see what is going on. A very relevant issue is

https://github.com/tiangolo/fastapi/issues/1487

ThreadPoolExecutor does work as expected

If we replace HighThroughputExecutor with ThreadPoolExecutor, we obtain the expected behavior (the endpoint function reaches its end without shutting down the server, and we can call it multiple times).

Further checks with recent parsl

We also installed parsl=2022.10.10, to try out PR https://github.com/Parsl/parsl/pull/2433, where HighThroughputExecutor has a start_method attribute taking values in ['fork', 'spawn', 'thread']. None of the three possible choices for start_method solves our issue (i.e. the server is always shut down by dfk.cleanup()).

tcompa commented 1 year ago

@mfranzon finally found the explanation, see:

The issue is due to a combination of uvicorn (our web server) and multiprocessing (which is part of parsl HTEX). And it is related to the HTEX interchange, rather than to the FlowControl.

Quoting from https://github.com/encode/uvicorn/issues/548#issuecomment-1157082729

asyncio setups signal handler in a specific way -- it calls signal.set_wakeup_fd and passes an fd of the opened socket. After it, if any signals are sent to the process, they will be written to this socket/fd.

Any child process will inherit not only signal handlers' behavior but an opened socket. And as a result, when we are sending a signal to the child process, it will be written to the socket and the parent process will receive it too, even though this signal was sent not to him; Or if you will send it to the parent process, the child process will receive this signal too;

To solve the issue in our minimal reproducible example, it is sufficient to add the lines

import signal
signal.set_wakeup_fd(-1)
signal.signal(signal.SIGTERM, signal.SIG_DFL)
signal.signal(signal.SIGINT, signal.SIG_DFL)

at the beginning of the starter function in parsl/executors/high_throughput/interchange.py: https://github.com/Parsl/parsl/blob/master/parsl/executors/high_throughput/interchange.py#L588-L599

We are now looking at the least risky way of adding this change (overriding the starterfunction, or forking parsl).

tcompa commented 1 year ago

fractal-server now uses https://github.com/fractal-analytics-platform/parsl/tree/1.3.1-dev, where we problem is solved as described in the previous comment. It's not obvious whether this is is something that may end up in upstream parsl. On the one hand, such change is not related to a parsl bug but rather to the mix of uvicorn and multiprocessing. On the other hand, having this change in parsl would enable using it within a uvicorn web server (not sure of unintended consequences though). I'll link to this issue on parsl slack channel, for information.

Closing.

tcompa commented 1 year ago

A comprehensive summary of this issue is now available in

https://github.com/Parsl/parsl/issues/2452

bitsofinfo commented 1 year ago

@tcompa your post above made my day, unrelated project/code... but it finally put an end to like 16hours of debugging a fastapi crash issue

fractal-analytics-platform / fractal-server