Closed tcompa closed 1 year ago
Some additional info:
dfk.cleanup()
triggers FlowControl._thread.join()
dfk.cleanup()
triggers HighThroughputExecutor.shutdown
, which triggers self.interchange_proc.terminate()
. Maybe change fork
to spawn
in https://github.com/Parsl/parsl/blob/master/parsl/multiprocessing.py#L15? To checkfrom a parsl developers perspective, i'm interested in understanding what's going on here: what process is your application server runing in that is getting incorrectly shut down? the main submitting/workflow process or one of the htex worker processes?
Our application lets users define some workflows (NOTE: "workflow" refers to some custom objects, rather than parsl workflows), which are then executed on a SLURM cluster. We currently use parsl for SLURM integration, and for some very basic dependency handling. The application runs as a server (based on fastapi and served via uvicorn) on the SLURM main node, and then another (client) application lets users interact with it from their local machines.
For each workflow, we'd like our server application to do the following:
Since multiple users can define and submit multiple workflows at the same time, we'd like to have one DFK per worfklow, rather than a single global one. This also helps in getting a clean monitoring, since (in our understanding) parsl-visualize shows an entry per DFK - meaning that all our users' workflows would get mixed up if we use a single DFK.
We implemented our approach and in principle we are satisfied with its modularity (start workflow -> start DFK, end workflow -> close DFK and related processes), up to the problem mentioned in this issue. The main server application calls a cleanup() method for some active DFK, and then the application itself receives a shutdown. The uvicorn shutdown is this one https://github.com/encode/uvicorn/blob/b22c9506b282e9c81591ed7f48781dc5759df0e0/uvicorn/lifespan/on.py#L63, but we still don't know which part of the DFK cleanup() triggers it. We also need to debug a bit further, to see whether this always happens (i.e. at each cleanup
), or is related to some more specific conditions.
(probably @jacopo-exact can chime in in the following days with some more details)
We (me and @tcompa) realized a minimal example that reproduce the error:
from fastapi import FastAPI
from parsl.dataflow.dflow import DataFlowKernel
from parsl.config import Config
from parsl.app.python import PythonApp
from parsl.executors import HighThroughputExecutor
import math
app = FastAPI()
@app.get("/")
async def root():
htex = HighThroughputExecutor()
conf = Config(executors=[htex])
dfk = DataFlowKernel(config=conf)
app = PythonApp(
math.cos,
data_flow_kernel=dfk
)
res = app(1).result()
dfk.cleanup()
return {"message": f'{res}'}
With @mfranzon we narrowed down the error a bit further, and now have a minimal reproducible example. We also explored recent parsl functionalities, with no success.
main.py
.fastapi
(we pip-installed version 0.78.0
), which also installs the web server uvicorn (we have 0.18.3
).uvicorn main:app
. The server is now up and running, as seen in the logs on the terminal
INFO: Started server process [71313]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
root
endpoint from the command line, via curl 127.0.0.1:8000/
. This leads to the correct execution of the parsl app, but then shuts down the server (see logs below), and from now on we cannot call the endpoint any more.
INFO: 127.0.0.1:52626 - "GET / HTTP/1.1" 200 OK
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [71313]
The statement triggering the shutdown is dfk.cleanup()
(this is clear because if we remove that line then the server is not shut down upon calling the endpoint). The cleanup
method internally calls:
FlowControl._thread.join()
and FlowControl._kill_event.set()
.HighThroughputExecutor.interchange_proc.terminate()
, where HighThroughputExecutor.interchange_proc
is a ForkProcess
.Our trivial guess is that one of these signals propagates to the process running the server, triggering its shutdown. It is unclear to us why/how this happens. We are digging further around the shutdown on the server side, to see what is going on. A very relevant issue is
If we replace HighThroughputExecutor
with ThreadPoolExecutor
, we obtain the expected behavior (the endpoint function reaches its end without shutting down the server, and we can call it multiple times).
We also installed parsl=2022.10.10
, to try out PR https://github.com/Parsl/parsl/pull/2433, where HighThroughputExecutor
has a start_method
attribute taking values in ['fork', 'spawn', 'thread']
. None of the three possible choices for start_method
solves our issue (i.e. the server is always shut down by dfk.cleanup()
).
@mfranzon finally found the explanation, see:
The issue is due to a combination of uvicorn (our web server) and multiprocessing (which is part of parsl HTEX). And it is related to the HTEX interchange, rather than to the FlowControl.
Quoting from https://github.com/encode/uvicorn/issues/548#issuecomment-1157082729
asyncio setups signal handler in a specific way -- it calls signal.set_wakeup_fd and passes an fd of the opened socket. After it, if any signals are sent to the process, they will be written to this socket/fd.
Any child process will inherit not only signal handlers' behavior but an opened socket. And as a result, when we are sending a signal to the child process, it will be written to the socket and the parent process will receive it too, even though this signal was sent not to him; Or if you will send it to the parent process, the child process will receive this signal too;
To solve the issue in our minimal reproducible example, it is sufficient to add the lines
import signal
signal.set_wakeup_fd(-1)
signal.signal(signal.SIGTERM, signal.SIG_DFL)
signal.signal(signal.SIGINT, signal.SIG_DFL)
at the beginning of the starter
function in parsl/executors/high_throughput/interchange.py
:
https://github.com/Parsl/parsl/blob/master/parsl/executors/high_throughput/interchange.py#L588-L599
We are now looking at the least risky way of adding this change (overriding the starter
function, or forking parsl).
fractal-server
now uses https://github.com/fractal-analytics-platform/parsl/tree/1.3.1-dev, where we problem is solved as described in the previous comment. It's not obvious whether this is is something that may end up in upstream parsl. On the one hand, such change is not related to a parsl bug but rather to the mix of uvicorn and multiprocessing. On the other hand, having this change in parsl would enable using it within a uvicorn web server (not sure of unintended consequences though). I'll link to this issue on parsl slack channel, for information.
Closing.
A comprehensive summary of this issue is now available in
@tcompa your post above made my day, unrelated project/code... but it finally put an end to like 16hours of debugging a fastapi crash issue
After https://github.com/fractal-analytics-platform/fractal-server/pull/88, we would like to cleanup all DFKs after a workflow is complete, but this leads to server shutdown:
The
dfk.cleanup()
line is currently commented out, but we need to solve this issue to avoid proliferation of processes.