Open tcompa opened 3 months ago
Here is an example if I start the server and send a kill -6
signal to a worker pid
2024-06-06 13:58:23,856 - uvicorn.error - INFO - Application startup complete.
2024-06-06 13:58:23,879 - fractal_server.lifespan - INFO - End application startup
2024-06-06 13:58:23,879 - uvicorn.error - INFO - Application startup complete.
...
2024-06-06 13:58:39,462 - uvicorn.error - INFO - [FractalWorker.custom_handle_abort - pid=364635] Now send SIGTERM to process.
2024-06-06 13:58:39,506 - uvicorn.error - INFO - Shutting down
2024-06-06 13:58:39,607 - uvicorn.error - INFO - Waiting for application shutdown.
2024-06-06 13:58:39,607 - uvicorn.error - INFO - Application shutdown complete.
2024-06-06 13:58:39,608 - uvicorn.error - INFO - Finished server process [364635]
2024-06-06 13:58:39,4 - gunicorn.error - ERROR - [pid 364633] - Worker (pid:364635) was sent SIGTERM!
2024-06-06 13:58:39,4 - gunicorn.error - INFO - [pid 364638] - Booting worker with pid: 364638
2024-06-06 13:58:39,4 - gunicorn.error - INFO - [pid 364638] - [FractalWorker.init_signals - pid=364638] Set `custom_handle_abort` for SIGABRT
2024-06-06 13:58:39,4 - gunicorn.error - INFO - [pid 364633] - Handling signal: winch
2024-06-06 13:58:39,4 - gunicorn.error - INFO - [pid 364633] - Handling signal: winch
2024-06-06 13:58:40,4 - gunicorn.error - INFO - [pid 364638] - Started server process [364638]
2024-06-06 13:58:40,4 - gunicorn.error - INFO - [pid 364638] - Waiting for application startup.
2024-06-06 13:58:40,820 - fractal_server.lifespan - INFO - Start application startup
2024-06-06 13:58:40,821 - fractal_server.app.db - WARNING - SQLite is supported (for version >=3.37) but discouraged in production. Given its partial support for ForeignKey constraints, database consistency cannot be guaranteed.
2024-06-06 13:58:40,867 - fractal_server.lifespan - INFO - End application startup
2024-06-06 13:58:40,867 - uvicorn.error - INFO - Application startup complete.
Got it: we are just not logging what is going on.
This has to do with
config_uvicorn_loggers()
logger.info("End application startup")
reset_logger_handlers(logger)
yield
logger = get_logger("fractal_server.lifespan")
logger.info("Start application shutdown")
Since we have a reset-handlers call, the next time we call get-logger there are no associated handlers.
A quick switch to set_logger
leads to
2024-06-06 14:01:32,705 - fractal_server.lifespan - INFO - End application startup
2024-06-06 14:01:32,706 - uvicorn.error - INFO - Application startup complete.
2024-06-06 14:01:32,738 - fractal_server.lifespan - INFO - End application startup
2024-06-06 14:01:32,738 - uvicorn.error - INFO - Application startup complete.
2024-06-06 14:01:38,186 - uvicorn.error - INFO - [FractalWorker.custom_handle_abort - pid=364875] Now send SIGTERM to process.
2024-06-06 14:01:38,247 - uvicorn.error - INFO - Shutting down
2024-06-06 14:01:38,348 - uvicorn.error - INFO - Waiting for application shutdown.
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Start application shutdown
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Current worker with pid 364875 is shutting down. Current jobs: app.state.jobsV1=[], app.state.jobsV2=[]
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - Not running `cleanup_after_shutdown`
2024-06-06 14:01:38,348 - fractal_server.lifespan - INFO - End application shutdown
2024-06-06 14:01:38,349 - uvicorn.error - INFO - Application shutdown complete.
2024-06-06 14:01:38,349 - uvicorn.error - INFO - Finished server process [364875]
2024-06-06 14:01:38,4 - gunicorn.error - ERROR - [pid 364873] - Worker (pid:364875) was sent SIGTERM!
2024-06-06 14:01:38,4 - gunicorn.error - INFO - [pid 364878] - Booting worker with pid: 364878
2024-06-06 14:01:38,4 - gunicorn.error - INFO - [pid 364878] - [FractalWorker.init_signals - pid=364878] Set `custom_handle_abort` for SIGABRT
2024-06-06 14:01:38,4 - gunicorn.error - INFO - [pid 364873] - Handling signal: winch
2024-06-06 14:01:38,4 - gunicorn.error - INFO - [pid 364873] - Handling signal: winch
2024-06-06 14:01:39,4 - gunicorn.error - INFO - [pid 364878] - Started server process [364878]
2024-06-06 14:01:39,4 - gunicorn.error - INFO - [pid 364878] - Waiting for application startup.
2024-06-06 14:01:39,546 - fractal_server.lifespan - INFO - Start application startup
The missing logs are now included - as of https://github.com/fractal-analytics-platform/fractal-server/pull/1569
Today I observed a reentrant call inside <_io.BufferedWriter name='<stderr>'>
upon ctrl-c. For the moment I fixed it by removing --worker-class fractal_server.gunicorn_fractal.FractalWorker
, but we should review this.
If I re-add the worker class, I cannot always reproduce the error.
This is part of the error stack
self.error_log.error(msg, *args, **kwargs)
File "/usr/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/usr/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/usr/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/usr/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/usr/lib/python3.10/logging/__init__.py", line 1218, in emit
StreamHandler.emit(self, record)
File "/usr/lib/python3.10/logging/__init__.py", line 1108, in emit
self.handleError(record)
File "/usr/lib/python3.10/logging/__init__.py", line 1021, in handleError
sys.stderr.write('--- Logging error ---\n')
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
File "/somewhere/venv/bin/gunicorn", line 8, in <module>
sys.exit(run())
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]", prog=prog).run()
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/base.py", line 236, in run
super().run()
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 224, in run
handler()
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 261, in handle_int
self.stop(False)
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 396, in stop
time.sleep(0.1)
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
self.reap_workers()
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/arbiter.py", line 554, in reap_workers
self.log.error(msg)
File "/somewhere/venv/lib/python3.10/site-packages/gunicorn/glogging.py", line 272, in error
self.error_log.error(msg, *args, **kwargs)
Message: 'Worker (pid:192744) was sent SIGINT!'
I'm not sure that https://github.com/fractal-analytics-platform/fractal-server/pull/1547 is working as expected, for a server running as
Let's review it in a real-life example.