fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Review graceful shutdown #1557

Open tcompa opened 3 months ago

tcompa commented 3 months ago

I'm not sure that https://github.com/fractal-analytics-platform/fractal-server/pull/1547 is working as expected, for a server running as

gunicorn \
    fractal_server.main:app --workers 2 \
    --worker-class fractal_server.gunicorn_fractal.FractalWorker \
    --logger-class fractal_server.gunicorn_fractal.FractalGunicornLogger \
    --bind 0.0.0.0:8000 --access-logfile - --error-logfile -

Let's review it in a real-life example.

tcompa commented 3 months ago

Here is an example if I start the server and send a kill -6 signal to a worker pid

2024-06-06 13:58:23,856 - uvicorn.error - INFO - Application startup complete.
2024-06-06 13:58:23,879 - fractal_server.lifespan - INFO - End application startup
2024-06-06 13:58:23,879 - uvicorn.error - INFO - Application startup complete.

...

2024-06-06 13:58:39,462 - uvicorn.error - INFO - [FractalWorker.custom_handle_abort - pid=364635] Now send SIGTERM to process.
2024-06-06 13:58:39,506 - uvicorn.error - INFO - Shutting down
2024-06-06 13:58:39,607 - uvicorn.error - INFO - Waiting for application shutdown.
2024-06-06 13:58:39,607 - uvicorn.error - INFO - Application shutdown complete.
2024-06-06 13:58:39,608 - uvicorn.error - INFO - Finished server process [364635]
2024-06-06 13:58:39,4   - gunicorn.error - ERROR - [pid 364633] - Worker (pid:364635) was sent SIGTERM!

2024-06-06 13:58:39,4   - gunicorn.error - INFO - [pid 364638] - Booting worker with pid: 364638
2024-06-06 13:58:39,4   - gunicorn.error - INFO - [pid 364638] - [FractalWorker.init_signals - pid=364638] Set `custom_handle_abort` for SIGABRT
2024-06-06 13:58:39,4   - gunicorn.error - INFO - [pid 364633] - Handling signal: winch
2024-06-06 13:58:39,4   - gunicorn.error - INFO - [pid 364633] - Handling signal: winch
2024-06-06 13:58:40,4   - gunicorn.error - INFO - [pid 364638] - Started server process [364638]
2024-06-06 13:58:40,4   - gunicorn.error - INFO - [pid 364638] - Waiting for application startup.
2024-06-06 13:58:40,820 - fractal_server.lifespan - INFO - Start application startup
2024-06-06 13:58:40,821 - fractal_server.app.db - WARNING - SQLite is supported (for version >=3.37) but discouraged in production. Given its partial support for ForeignKey constraints, database consistency cannot be guaranteed.
2024-06-06 13:58:40,867 - fractal_server.lifespan - INFO - End application startup
2024-06-06 13:58:40,867 - uvicorn.error - INFO - Application startup complete.
tcompa commented 3 months ago

Got it: we are just not logging what is going on.

This has to do with

    config_uvicorn_loggers()
    logger.info("End application startup")
    reset_logger_handlers(logger)
    yield
    logger = get_logger("fractal_server.lifespan")
    logger.info("Start application shutdown")

Since we have a reset-handlers call, the next time we call get-logger there are no associated handlers.


A quick switch to set_logger leads to

2024-06-06 14:01:32,705 - fractal_server.lifespan - INFO - End application startup
2024-06-06 14:01:32,706 - uvicorn.error - INFO - Application startup complete.
2024-06-06 14:01:32,738 - fractal_server.lifespan - INFO - End application startup
2024-06-06 14:01:32,738 - uvicorn.error - INFO - Application startup complete.

2024-06-06 14:01:38,186 - uvicorn.error - INFO - [FractalWorker.custom_handle_abort - pid=364875] Now send SIGTERM to process.
2024-06-06 14:01:38,247 - uvicorn.error - INFO - Shutting down
2024-06-06 14:01:38,348 - uvicorn.error - INFO - Waiting for application shutdown.
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Start application shutdown
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Current worker with pid 364875 is shutting down. Current jobs: app.state.jobsV1=[], app.state.jobsV2=[]
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Not running `cleanup_after_shutdown`
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - End application shutdown
2024-06-06 14:01:38,349 - uvicorn.error - INFO - Application shutdown complete.
2024-06-06 14:01:38,349 - uvicorn.error - INFO - Finished server process [364875]
2024-06-06 14:01:38,4   - gunicorn.error - ERROR - [pid 364873] - Worker (pid:364875) was sent SIGTERM!

2024-06-06 14:01:38,4   - gunicorn.error - INFO - [pid 364878] - Booting worker with pid: 364878
2024-06-06 14:01:38,4   - gunicorn.error - INFO - [pid 364878] - [FractalWorker.init_signals - pid=364878] Set `custom_handle_abort` for SIGABRT
2024-06-06 14:01:38,4   - gunicorn.error - INFO - [pid 364873] - Handling signal: winch
2024-06-06 14:01:38,4   - gunicorn.error - INFO - [pid 364873] - Handling signal: winch
2024-06-06 14:01:39,4   - gunicorn.error - INFO - [pid 364878] - Started server process [364878]
2024-06-06 14:01:39,4   - gunicorn.error - INFO - [pid 364878] - Waiting for application startup.
2024-06-06 14:01:39,546 - fractal_server.lifespan - INFO - Start application startup
tcompa commented 2 months ago

The missing logs are now included - as of https://github.com/fractal-analytics-platform/fractal-server/pull/1569

tcompa commented 2 months ago

Today I observed a reentrant call inside <_io.BufferedWriter name='<stderr>'> upon ctrl-c. For the moment I fixed it by removing --worker-class fractal_server.gunicorn_fractal.FractalWorker, but we should review this.

If I re-add the worker class, I cannot always reproduce the error.

This is part of the error stack


    self.error_log.error(msg, *args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1506, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
    self.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
    self.emit(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
    StreamHandler.emit(self, record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
    self.handleError(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 1021, in handleError
    sys.stderr.write('--- Logging error ---\n')
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
  File "/somewhere/venv/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]", prog=prog).run()
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/base.py", line 236, in run
    super().run()
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 224, in run
    handler()
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 261, in handle_int
    self.stop(False)
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 396, in stop
    time.sleep(0.1)
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 554, in reap_workers
    self.log.error(msg)
  File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/glogging.py", line 272, in error
    self.error_log.error(msg, *args, **kwargs)
Message: 'Worker (pid:192744) was sent SIGINT!'