Closed bigunyak closed 4 years ago
The error is not expected, but there is nothing from your example that shows why it happens. Tell us more about your environment.
Tell us more about your environment.
I've just reproduced the problem on a completely fresh setup, here are the steps:
mkdir gunicorn
cd gunicorn/
pipenv --python 3.6
pipenv install flask
pipenv install gunicorn
vim hello.py
pipenv shell
gunicorn -b 0.0.0.0:5000 --log-level=debug hello
The hello.py
is exactly the same Flask application as I posted in the initial report.
Below is the complete log.
~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Current configuration:
config: None
bind: ['0.0.0.0:5000']
backlog: 2048
workers: 1
worker_class: sync
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 30
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /home/dima/work/gunicorn
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 1000
group: 985
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
statsd_prefix:
proc_name: None
default_proc_name: hello
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0x7f9757112d08>
on_reload: <function OnReload.on_reload at 0x7f9757112e18>
when_ready: <function WhenReady.when_ready at 0x7f9757112f28>
pre_fork: <function Prefork.pre_fork at 0x7f9756c230d0>
post_fork: <function Postfork.post_fork at 0x7f9756c231e0>
post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9756c232f0>
worker_int: <function WorkerInt.worker_int at 0x7f9756c23400>
worker_abort: <function WorkerAbort.worker_abort at 0x7f9756c23510>
pre_exec: <function PreExec.pre_exec at 0x7f9756c23620>
pre_request: <function PreRequest.pre_request at 0x7f9756c23730>
post_request: <function PostRequest.post_request at 0x7f9756c237b8>
child_exit: <function ChildExit.child_exit at 0x7f9756c238c8>
worker_exit: <function WorkerExit.worker_exit at 0x7f9756c239d8>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9756c23ae8>
on_exit: <function OnExit.on_exit at 0x7f9756c23bf8>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: TLSv1
raw_paste_global_conf: []
[2018-06-06 09:16:21 +0200] [19829] [INFO] Starting gunicorn 19.8.1
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Arbiter booted
[2018-06-06 09:16:21 +0200] [19829] [INFO] Listening at: http://0.0.0.0:5000 (19829)
[2018-06-06 09:16:21 +0200] [19829] [INFO] Using worker: sync
[2018-06-06 09:16:21 +0200] [19832] [INFO] Booting worker with pid: 19832
[2018-06-06 09:16:22 +0200] [19829] [DEBUG] 1 workers
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
[2018-06-06 09:17:19 +0200] [19832] [INFO] Worker exiting (pid: 19832)
[2018-06-06 09:17:19 +0200] [19872] [INFO] Booting worker with pid: 19872
^C[2018-06-06 09:17:26 +0200] [19829] [INFO] Handling signal: int
[2018-06-06 09:17:26 +0200] [19872] [INFO] Worker exiting (pid: 19872)
[2018-06-06 09:17:26 +0200] [19829] [INFO] Shutting down: Master
~$ pip list
Package Version
------------ -------
click 6.7
Flask 1.0.2
gunicorn 19.8.1
itsdangerous 0.24
Jinja2 2.10
MarkupSafe 1.0
pip 10.0.1
setuptools 39.2.0
Werkzeug 0.14.1
wheel 0.31.1
@bigunyak I think it's because of the default timeout, your worker has been silent for 30s. http://docs.gunicorn.org/en/stable/settings.html#timeout
From your log,
[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
I'm seeing the same thing: workers are timing out even when serving no requests, with sync worker. To that sense, the critical level log is quite confusing.
Try use Gevent worker could solve this.
To that sense, the critical level log is quite confusing.
Exactly, that was my original question: if it's an expected behavior why critical error then? Would be also nice to get some background on why workers need to be restarted, maybe this could be added to the Design document.
I'm seeing this as well (reproduced by using examples/test.py
with gunicorn test:app -b localhost:9595 --log-level=debug --timeout=5
) and I agree that the critical level is a bit confusing. I'd be OK change it to debug level. @benoitc @tilgovi what do you think?
I'm thinking info level might be a little bit better.
I had the same with MSYS2 on Win10 but finally could solved.
in notify() of ...\gunicorn\workers\workertmp.py, os.fchmod is used originally. But it does not work in MSYS. Instead of os.fchmod, I used os.utime. The code is followed. I think it could work for all platform.
def notify(self):
try:
self.spinner = (self.spinner + 1) % 2
os.fchmod(self._tmp.fileno(), self.spinner)
if PLATFORM.startswith('MSYS') :
os.utime(self._tmp.fileno(), None)
except AttributeError:
# python < 2.6
self._tmp.truncate(0)
os.write(self._tmp.fileno(), b"X")
@berkerpeksag I wouldn't expect that the worker exit because no requests happen. This error should only happen if the worker has been kept busy for a time > to the timeout. So the error is critical. Imo we should improve the documentation to provide more use cases and responses to such errors.
If the error still happen when the worker is not kept busy then there is something else happening and we have probably a bug.
[EDIT] Same bug for me. With Django 1.10 / gunicorn 19.6.0 / Python 2.7.15 in python2.7-alpine on Debian 8.8 and stock kernel 3.16, all was working fine. After updating to Django 1.11 and gunicorn 19.8.1, the workers keep failing at boot with [CRITICAL WORKER TIMEOUT]. Downgrading gunicorn to 19.6.0 doesn't fix the problem. We updated the host kernel to 4.9.0 (it was scheduled), and the workers successfully booted without errors. But:
We're going to try gunicorn gevent to see if we'are able to get our app back online.
Using gunicorn with gevent didn't fixed the bug.
any update on this issue ?
It looks like @neocolor identified a real bug under MSYS. It might deserve a separate issue.
@bigunyak what platform are you running under? I have tried to reproduce with the simple example and I cannot do it following exactly the steps outlined above. This agrees with my experience running multiple applications in production on multiple frameworks. The worker notification system has not changed recently, to my knowledge. My platform is Python 3.7 on MacOS 10.13.6, but I run Gunicorn in production with sync workers for several applications on Python 2.7 and 3.6.5 and only see worker timeouts when there are legitimately long requests that block the workers.
For @Tberdy: what happens if you try to set --worker-tmp-dir
to somewhere outside the tmpfs filesystem? I am just wondering if maybe alpine or docker or the combination interferes somehow with how workers notify the arbiter.
See also #1388 for Docker related tmpfs issues.
i have this issue to.
I have this issue too, gunicorn sync was working perfectly well until yesternight, ut started reporting, workers timeout [CRITICAL] using gevent seems to solve my issue, but I'd really want to know why this happened .
@timoj58 @cjmash can you provides more detail about the issue ? How are you running gunicorn (in a vm?, options ...), which fileystem, OS? ANthing that could help to reproduce would be very helpful :)
@benoitc I am running gunicorn to start my Django project on kubernetes my gunicorn arguments are --bind=$port --workers=7 --timeout=1200 --log-level=debug --access-logfile - error-logfile -"
the errors i get from the logs
E [2018-08-09 21:47:56 +0000] [14] [INFO] Booting worker with pid: 14
E [2018-08-09 21:47:56 +0000] [12] [INFO] Booting worker with pid: 12
E [2018-08-09 21:47:56 +0000] [1] [DEBUG] 7 workers
E [2018-08-09 21:47:56 +0000] [11] [INFO] Booting worker with pid: 11
E [2018-08-09 21:47:55 +0000] [10] [INFO] Booting worker with pid: 10
E [2018-08-09 21:47:55 +0000] [9] [INFO] Booting worker with pid: 9
E [2018-08-09 21:47:55 +0000] [8] [INFO] Booting worker with pid: 8
E [2018-08-09 21:47:55 +0000] [1] [INFO] Using worker: sync
E [2018-08-09 21:47:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)
E [2018-08-09 21:47:55 +0000] [1] [DEBUG] Arbiter booted
E [2018-08-09 21:47:55 +0000] [1] [INFO] Starting gunicorn 19.7.1
E raw_paste_global_conf: []
E ciphers: TLSv1
E do_handshake_on_connect: False
E suppress_ragged_eofs: True
E ca_certs: None
E cert_reqs: 0
E ssl_version: 2
E certfile: None
E keyfile: None
E proxy_allow_ips: ['127.0.0.1']
E proxy_protocol: False
E on_exit: <function OnExit.on_exit at 0x7f9612aad7b8>
E nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9612aad620>
E worker_exit: <function WorkerExit.worker_exit at 0x7f9612aad488>
E child_exit: <function ChildExit.child_exit at 0x7f9612aad2f0>
E post_request: <function PostRequest.post_request at 0x7f9612aad158>
E pre_request: <function PreRequest.pre_request at 0x7f9612aad048>
E pre_exec: <function PreExec.pre_exec at 0x7f9612aa8e18>
E worker_abort: <function WorkerAbort.worker_abort at 0x7f9612aa8c80>
E worker_int: <function WorkerInt.worker_int at 0x7f9612aa8ae8>
E post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9612aa8950>
E post_fork: <function Postfork.post_fork at 0x7f9612aa87b8>
E pre_fork: <function Prefork.pre_fork at 0x7f9612aa8620>
E when_ready: <function WhenReady.when_ready at 0x7f9612aa8488>
E on_reload: <function OnReload.on_reload at 0x7f9612aa82f0>
E on_starting: <function OnStarting.on_starting at 0x7f9612aa8158>
E paste: None
E pythonpath: None
E default_proc_name: art.wsgi
E proc_name: None
E statsd_prefix:
E statsd_host: None
E enable_stdio_inheritance: False
E syslog_facility: user
E syslog_prefix: None
E syslog: False
E syslog_addr: udp://localhost:514
E logconfig: None
E logger_class: gunicorn.glogging.Logger
E capture_output: False
E loglevel: debug
E errorlog: -
E access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
E accesslog: -
E forwarded_allow_ips: ['127.0.0.1']
E secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
E tmp_upload_dir: None
E initgroups: False
E umask: 0
E group: 0
E user: 0
E worker_tmp_dir: None
E pidfile: None
E raw_env: []
E daemon: False
E chdir: /usr/src/app
E sendfile: None
E preload_app: False
E check_config: False
E spew: False
E reload_engine: auto
E reload: False
E limit_request_field_size: 8190
E limit_request_fields: 100
E limit_request_line: 4094
E keepalive: 2
E graceful_timeout: 30
E timeout: 1200
E max_requests_jitter: 0
E max_requests: 0
E worker_connections: 1000
E threads: 1
E worker_class: sync
E workers: 7
E backlog: 2048
E bind: ['0.0.0.0:4040']
E config: None
E [2018-08-09 21:47:55 +0000] [1] [DEBUG] Current configuration:
I No migrations to apply.
I Running migrations:
I Apply all migrations: admin, auth, contenttypes, core, dashboard, jet, oauth2_provider, sessions
I Operations to perform:
E [2018-08-09 21:47:20 +0000] [13] [INFO] Worker exiting (pid: 13)
E os.path.dirname(os.path.dirname(__file__)), '.env'))
E /usr/src/app/art/wsgi.py:19: UserWarning: Not reading /usr/src/app/.env - it doesn't exist.
E [2018-08-09 21:47:00 +0000] [21] [INFO] Booting worker with pid: 21
E [2018-08-09 21:47:00 +0000] [1] [INFO] Handling signal: term
E [2018-08-09 21:46:35 +0000] [12] [INFO] Booting worker with pid: 12
E [2018-08-09 21:46:34 +0000] [13] [INFO] Booting worker with pid: 13
E [2018-08-09 21:46:34 +0000] [11] [INFO] Booting worker with pid: 11
E [2018-08-09 21:46:34 +0000] [1] [DEBUG] 7 workers
E [2018-08-09 21:46:34 +0000] [10] [INFO] Booting worker with pid: 10
E [2018-08-09 21:46:34 +0000] [9] [INFO] Booting worker with pid: 9
E [2018-08-09 21:46:34 +0000] [8] [INFO] Booting worker with pid: 8
E [2018-08-09 21:46:34 +0000] [7] [INFO] Booting worker with pid: 7
E [2018-08-09 21:46:34 +0000] [1] [INFO] Using worker: sync
E [2018-08-09 21:46:34 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)
E [2018-08-09 21:46:34 +0000] [1] [DEBUG] Arbiter booted
E [2018-08-09 21:46:34 +0000] [1] [INFO] Starting gunicorn 19.7.1
E raw_paste_global_conf: []
E ciphers: TLSv1
E do_handshake_on_connect: False
E suppress_ragged_eofs: True
E ca_certs: None
E cert_reqs: 0
E ssl_version: 2
E certfile: None
E keyfile: None
E proxy_allow_ips: ['127.0.0.1']
E proxy_protocol: False
E on_exit: <function OnExit.on_exit at 0x7f07f2cfe7b8>
E nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f07f2cfe620>
E worker_exit: <function WorkerExit.worker_exit at 0x7f07f2cfe488>
E child_exit: <function ChildExit.child_exit at 0x7f07f2cfe2f0>
E post_request: <function PostRequest.post_request at 0x7f07f2cfe158>
E pre_request: <function PreRequest.pre_request at 0x7f07f2cfe048>
E pre_exec: <function PreExec.pre_exec at 0x7f07f2cf9e18>
E worker_abort: <function WorkerAbort.worker_abort at 0x7f07f2cf9c80>
E worker_int: <function WorkerInt.worker_int at 0x7f07f2cf9ae8>
E post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f07f2cf9950>
E post_fork: <function Postfork.post_fork at 0x7f07f2cf97b8>
E pre_fork: <function Prefork.pre_fork at 0x7f07f2cf9620>
E when_ready: <function WhenReady.when_ready at 0x7f07f2cf9488>
E on_reload: <function OnReload.on_reload at 0x7f07f2cf92f0>
E on_starting: <function OnStarting.on_starting at 0x7f07f2cf9158>
E paste: None
E pythonpath: None
E default_proc_name: art.wsgi
E proc_name: None
E statsd_prefix:
E statsd_host: None
E enable_stdio_inheritance: False
E syslog_facility: user
E syslog_prefix: None
E syslog: False
E syslog_addr: udp://localhost:514
E logconfig: None
E logger_class: gunicorn.glogging.Logger
E capture_output: False
E loglevel: debug
E errorlog: -
E access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
E accesslog: -
E forwarded_allow_ips: ['127.0.0.1']
E secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
E tmp_upload_dir: None
E initgroups: False
E umask: 0
E group: 0
E user: 0
E worker_tmp_dir: None
E pidfile: None
E raw_env: []
E daemon: False
E chdir: /usr/src/app
E sendfile: None
E preload_app: False
E check_config: False
E spew: False
E reload_engine: auto
E reload: False
E limit_request_field_size: 8190
E limit_request_fields: 100
E limit_request_line: 4094
E keepalive: 2
E graceful_timeout: 30
E timeout: 1200
E max_requests_jitter: 0
E max_requests: 0
E worker_connections: 1000
E threads: 1
E worker_class: sync
E workers: 7
E backlog: 2048
E bind: ['0.0.0.0:4040']
E config: None
E [2018-08-09 21:46:34 +0000] [1] [DEBUG] Current configuration:
I struggled a bit to reproduce the problem this time but it's still there in the latest gunicorn version 19.9.0. The steps to reproduce it are exactly the same as I described in this post. My system is Arch Linux x86_64 GNU/Linux (kernel 4.17.2-1-ARCH), Python 3.6.5 Here is a fresh log trace.
~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Current configuration:
config: None
bind: ['0.0.0.0:5000']
backlog: 2048
workers: 1
worker_class: sync
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 30
graceful_timeout: 30
keepalive: 2
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /home/dima/lerning/python/modules/gunicorn
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: None
user: 1000
group: 985
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: None
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
statsd_prefix:
proc_name: None
default_proc_name: hello
pythonpath: None
paste: None
on_starting: <function OnStarting.on_starting at 0x7f5f8b9dad08>
on_reload: <function OnReload.on_reload at 0x7f5f8b9dae18>
when_ready: <function WhenReady.when_ready at 0x7f5f8b9daf28>
pre_fork: <function Prefork.pre_fork at 0x7f5f8b4ec0d0>
post_fork: <function Postfork.post_fork at 0x7f5f8b4ec1e0>
post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f5f8b4ec2f0>
worker_int: <function WorkerInt.worker_int at 0x7f5f8b4ec400>
worker_abort: <function WorkerAbort.worker_abort at 0x7f5f8b4ec510>
pre_exec: <function PreExec.pre_exec at 0x7f5f8b4ec620>
pre_request: <function PreRequest.pre_request at 0x7f5f8b4ec730>
post_request: <function PostRequest.post_request at 0x7f5f8b4ec7b8>
child_exit: <function ChildExit.child_exit at 0x7f5f8b4ec8c8>
worker_exit: <function WorkerExit.worker_exit at 0x7f5f8b4ec9d8>
nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f5f8b4ecae8>
on_exit: <function OnExit.on_exit at 0x7f5f8b4ecbf8>
proxy_protocol: False
proxy_allow_ips: ['127.0.0.1']
keyfile: None
certfile: None
ssl_version: 2
cert_reqs: 0
ca_certs: None
suppress_ragged_eofs: True
do_handshake_on_connect: False
ciphers: TLSv1
raw_paste_global_conf: []
[2018-08-10 09:48:40 +0200] [23114] [INFO] Starting gunicorn 19.9.0
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Arbiter booted
[2018-08-10 09:48:40 +0200] [23114] [INFO] Listening at: http://0.0.0.0:5000 (23114)
[2018-08-10 09:48:40 +0200] [23114] [INFO] Using worker: sync
[2018-08-10 09:48:40 +0200] [23117] [INFO] Booting worker with pid: 23117
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] 1 workers
[2018-08-10 09:48:45 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:48:54 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:00 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] Closing connection.
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:23 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] Closing connection.
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:50 +0200] [23117] [DEBUG] Closing connection.
[2018-08-10 09:51:11 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:13 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:43 +0200] [23114] [CRITICAL] WORKER TIMEOUT (pid:23117)
[2018-08-10 09:51:43 +0200] [23117] [INFO] Worker exiting (pid: 23117)
[2018-08-10 09:51:44 +0200] [23229] [INFO] Booting worker with pid: 23229
As before to test it I was just hitting http://0.0.0.0:5000/ in Chromium. Below is the content of Pipfile and Pipfile.lock for you to see the exact environment.
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages] flask = "" gunicorn = ""
[dev-packages]
[requires] python_version = "3.6"
* Pipfile.lock
```json
{
"_meta": {
"hash": {
"sha256": "81cb5d5f0b11719d8d9c5ec9cc683fdcf959c652fda256d5552a82d0f459a99c"
},
"pipfile-spec": 6,
"requires": {
"python_version": "3.6"
},
"sources": [
{
"name": "pypi",
"url": "https://pypi.org/simple",
"verify_ssl": true
}
]
},
"default": {
"click": {
"hashes": [
"sha256:29f99fc6125fbc931b758dc053b3114e55c77a6e4c6c3a2674a2dc986016381d",
"sha256:f15516df478d5a56180fbf80e68f206010e6d160fc39fa508b65e035fd75130b"
],
"version": "==6.7"
},
"flask": {
"hashes": [
"sha256:2271c0070dbcb5275fad4a82e29f23ab92682dc45f9dfbc22c02ba9b9322ce48",
"sha256:a080b744b7e345ccfcbc77954861cb05b3c63786e93f2b3875e0913d44b43f05"
],
"index": "pypi",
"version": "==1.0.2"
},
"gunicorn": {
"hashes": [
"sha256:aa8e0b40b4157b36a5df5e599f45c9c76d6af43845ba3b3b0efe2c70473c2471",
"sha256:fa2662097c66f920f53f70621c6c58ca4a3c4d3434205e608e121b5b3b71f4f3"
],
"index": "pypi",
"version": "==19.9.0"
},
"itsdangerous": {
"hashes": [
"sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519"
],
"version": "==0.24"
},
"jinja2": {
"hashes": [
"sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd",
"sha256:f84be1bb0040caca4cea721fcbbbbd61f9be9464ca236387158b0feea01914a4"
],
"version": "==2.10"
},
"markupsafe": {
"hashes": [
"sha256:a6be69091dac236ea9c6bc7d012beab42010fa914c459791d627dad4910eb665"
],
"version": "==1.0"
},
"werkzeug": {
"hashes": [
"sha256:c3fd7a7d41976d9f44db327260e263132466836cef6f91512889ed60ad26557c",
"sha256:d5da73735293558eb1651ee2fddc4d0dedcfa06538b8813a2e20011583c9e49b"
],
"version": "==0.14.1"
}
},
"develop": {}
}
Just FYI, I'm also seeing this failure very regularly with:
nohup gunicorn -w 8 --access-logfile=- --bind 127.0.0.1:5000 wsgi &
I just have a wsgi.py that has
from chart_api import application
if __name__ == "__main__":
application.run()
Let me know if there is anything you want me to try/experiment or if there are specifics in the logs you want me to check. I'm running flask on a GCP VM.
sorry for late reply. I am running it as
gunicorn --log-file=/home/ubuntu/log/gunicorn.log predictor_api:app -b localhost:5000 &
i did use the gevent setting etc, but i have changed my design of what i required this for in order to work around issue, hence the basic setting above (which also failed but this is to be expected given no gevent)
Python version 3.6 env: aws tensorflow_p36 optmized (ubuntu) nginx is sat in front of gunicorn, which is executing a flask app.
flask version 1.0.2 nginx version 1.10.3 gunicorn version 19.9.0
I did change the nginx timeouts as well incase this may have caused it.
Facing the same problem with gunicorn server
Flask==0.12.1 gunicorn==19.7.1
when i start server with above command, my system got freezing for some time and worker pids are keeping on booting , though i kept timeout 120 sec, and the server is not accepting single request.
Any update on this issue ? I have same issue
[CRITICAL] WORKER TIMEOUT
Wondering whether anyone has successfully reproduced this in a Docker image?
Also seeing this when trying to implement datadog's ddtrace-run worker on an existing application starting with gunicorn -k gevent --threads 4.
Funny trace of a SystemExit I've never seen before too...
[2018-11-07 11:11:50 +0000] [15987] [INFO] Booting worker with pid: 15987 [2018-11-07 11:11:50 +0000] [15977] [DEBUG] 1 workers [2018-11-07 11:12:20 +0000] [15977] [CRITICAL] WORKER TIMEOUT (pid:15987) Exception SystemExit: 1 in <bound method LibevLoop._loop_will_run of <cassandra.io.libevreactor.LibevLoop object at 0x7f3cb66d4a50>> ignored
I am able to resolve this issue by matching the number of workers and number of threads.
I had set workers = (2 * cpu_count) + 1
and did not set threads.
Once I changed threads = workers
, everything started working fine. Just in case, if this helps someone.
This is how it looks now
def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
"""Run the app with Gunicorn."""
if app.debug:
app.run(host, int(port), use_reloader=False)
else:
gunicorn = WSGIApplication()
gunicorn.load_wsgiapp = lambda: app
gunicorn.cfg.set('bind', '%s:%s' % (host, port))
gunicorn.cfg.set('workers', workers)
gunicorn.cfg.set('threads', workers)
gunicorn.cfg.set('pidfile', None)
gunicorn.cfg.set('worker_class', 'sync')
gunicorn.cfg.set('keepalive', 10)
gunicorn.cfg.set('accesslog', '-')
gunicorn.cfg.set('errorlog', '-')
gunicorn.cfg.set('reload', True)
gunicorn.chdir()
gunicorn.run()
I encountered this issue running Django with a single Docker container on AWS Elastic Beanstalk. I resolved the issue by fixing my security groups to ensure my EC2 instance could talk to my RDS instance. I recognize this may not be the solution for 99% of folks on this issue, but I'm leaving this note to help others avoid wasting hours falling down this rabbit hole.
In my case the issue was resolved by performing systemctl daemon-reload
.
Long story short I knew of the default timeout at 30sec, but was unable to change it until i figured i had to of course reload the systemd daemon to apply the changes to the service.
@bigunyak @benoitc @tilgovi After a long google goose chase and some experimentation I believe the root cause of this issue is chrome "preconnect/prediction service" (which is enabled by default in both Chrome and Chromium).
@jeiting did a nice write up about this issue https://hackernoon.com/chrome-preconnect-breaks-singly-threaded-servers-95944be16400 Some additional reading: https://github.com/corydolphin/flask-cors/issues/147 https://github.com/pallets/flask/issues/2169
TLDR In some cases Chrome/Chromium opens and holds an "empty" TCP connection(s) (i.e its predicting that its going to fetch another resource soon). If the "empty" TCP connection(s) hits your gunicorn server first then any subsequent "real" request from chrome can get stuck behind the "empty" request until the worker(s) handling the "empty" request times out. This is more likely to happen if you are only running a single sync worker in gunicorn. But as my experiments found it can happen even if you are running multiple sync workers.
My Environment
FROM ubuntu:18.04
...
RUN pip3 install Flask==1.0.2
RUN pip3 install gunicorn==19.9.0
RUN pip3 install flask-cors==3.0.6
......
Experiment 1 Gunicorn config: 1 sync worker (default 30s timeout)
Firefox: On almost every react page load, the CORS OPTIONS request gets blocked for 5s, and then succeeds. Chromium: On EVERY react page load, CORS OPTIONS request gets blocked for 1.5minutes!!!! and then succeeds. Chromium(prediction service disabled): Everything loads fine Chrome: Everything loads fine
Experiment 2 Gunicorn config: 4 sync workers (default 30s timeout)
Firefox: Everything loads fine Chromium: On every 3rd react page load, CORS OPTIONS request gets blocked for 30s and then succeeds. Chromium(prediction service disabled): Everything loads fine Chrome: Everything loads fine
Experiment 3 Gunicorn config: 8 sync workers (default 30s timeout)
Firefox: Everything loads fine Chromium: Everything loads fine Chrome: Everything loads fine
Experiment 4
Run flask dev server with threaded=True
Firefox: Everything loads fine Chromium: Everything loads fine Chrome: Everything loads fine
Experiment Summary
Solution In Production: Easiest fix is to put nginx in front of gunicorn. If you want to understand why nginx solves this issue, here is nice article that explains it: https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/
In dev: Easiest fix is to run flask dev server with threaded=True
. Or you can disable the prediction service in Chrome/Chromium.
gunicorn debug improvements
To help debug an issue like this in the future, I think it's a good idea to add debug log statements next to the select()
and accept()
calls in the sync worker.
https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L26
https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L34
This would show that a worker has accepted a new TCP connection, but is not receiving any data.
@asnisarenko super write-up, thanks. #1929 is another unusual case of a slow-client which yields similar symptom in the single-threaded server - in that case, TLS handshake to a non-TLS port looks like a slow-writing client, since it fails to send sensible headers quickly.
It may be the Gunicorn, for single-threaded sync workers is going to need some new tunable, an aggressive client-drop for clients which fail to send at least a first request-header line, if not all headers, within a reasonable time.
If this happens in your app, you can add some temporary diagnostic code:
def trace_on_abort():
import signal
import traceback
def print_trace(sig, frame):
print(''.join(traceback.format_stack(frame)))
signal.signal(signal.SIGABRT, print_trace)
Hopefully this should reveal where your app hangs.
For example, I somehow managed to import flask
before gevent.monkey.patch_all()
and so my Flask's app._before_request_lock
ended up being a non-gevent lock (i.e. plain unpatched threading.Lock
). In such a case, two requests in close succession on app start-up would cause it to hang (2nd request would lock the entire thread). But this was my bug, and your bug may differ.
@asnisarenko I'm not sure why you're experiencing a hang. You absolutely do not need a worker per connection, since every worker can cooperatively handle thousands of connections (as long as the handling code doesn't monopolize the CPU / lock the entire thread). Specifically, if the browser stalls the request, it simply means that when gunicorn read()
s from the socket, it will block a greenlet, but since read
should be monkey-patched it'll yield to the other greenlets.
@ikonst
This issue is talking about running gunicorn with default configuration. The default worker class is sync
http://docs.gunicorn.org/en/stable/settings.html#worker-class
sync
workers use a pre-fork model and each worker processes one TCP connection/request at a time.
I don't know what caused this for me, but switching from the default sync
worker type to eventlet
resolved the problem:
CMD pipenv run gunicorn webapp -b 0.0.0.0:8080 -k eventlet
Good luck.
Maybe this is gonna help only me, but since it took me 7h to debug this issue on my side (a Flask/Gunicorn app running in Docker on EC2), if I can spare this headache to a few people, it's a small victory.
The container memory limit that was too low. I initially set it up at 100 MB, but apparently the app is requesting about 110 MB when running... Getting rid of the limit prevented my workers to timeout whith (apparently) no reason.
If I have memory leaks in the future, I'll try to put a higher memory limit, but for now, no limit will do.
Thanks
def trace_on_abort(): import signal import traceback def print_trace(sig, frame): print(''.join(traceback.format_stack(frame))) signal.signal(signal.SIGABRT, print_trace)
You ыo much help me. EHU )))
Where does this trace_on_abort()
method go? In the app code file? Does it need an annotation?
@mattg-vbt add this to the beginning of your app when you need to debug, and remove it once you're done debugging
@ikonst I added this to my app, but I don't see it getting hit. I get the worker timeout, but this method isn't hit.
@mattg-vbt can you try doing kill -ABRT $pid
to see the print_trace
function being called? (the idea is that you'll get SIGABRT from werkzeug when your worker times out, but let's first check that it's called at all)
@asnisarenko for the less savvy amongst us how then would I update the procfile?
web gunicorn app:app -k gevent --worker-connections 1000
?
@SumNeuron If you are using gevent workers you (instead of sync workers) then you should not be hitting this issue. So the command you posted seems fine.
I'm getting this issue with gevent workers only upon initialization, which is a bit annoying as I am processing some tasks before launching my app. But I am setting a high timeout for now. I've posted a reproducible test repo here: https://github.com/zamponotiropita/test-gunicorn-worker-timeout -> test_0 fails, test_1 and test_2 pass
@zamponotiropita Are you trying to do this per-worker or per-app before the forking?
@ikonst please check the run file run.sh
, it's per worker. The problem didn't come up with preloading, but I had issues connecting to the database when preloading, since the app object (and with it together the database connection) gets copied and passed the processes when forking -> my database couldn't handle identical connections of master and workers, and I couldn't find a workaround
I am having a similar issue. I am trying to generate a lot of data on-the-fly with Flask using
this method, and it fails because the worker times out after whatever was set in --timeout
expires. A minimal example to reproduce:
test_gunicorn_timeout.py
import flask
from time import sleep
app = flask.Flask(__name__)
@app.route('/')
def gunicorn_timeout():
def generator():
for _ in range(10):
yield b'Yet another line...'
sleep(2)
return flask.Response(generator(), mimetype='text/plain')
Then run gunicorn --timeout 10 test_gunicorn_timeout:app
and when requesting localhost:8000
after 10 seconds you get a
[CRITICAL] WORKER TIMEOUT
.
I also tried running with -k gevent
and -k eventlet
and nothing changed.
I am on windows 10 Pro. With docker compose
gunicorn app -b 0.0.0.0:8000 -k gevent
Worked for me by installing gevent in my python container
How to install gevent
Then run
gunicorn --timeout 10 test_gunicorn_timeout:app
and when requestinglocalhost:8000
after 10 seconds you get a[CRITICAL] WORKER TIMEOUT
.I also tried running with
-k gevent
and-k eventlet
and nothing changed.
@ltskv
have you found a solution to the [CRITICAL] WORKER TIMEOUT issue?
I have a similar problem with a flask streaming endpoint I am trying to deploy on heroku (which uses gunicorn). There, the proxy frontend is supposed to keep the connection alive as long as some data is sent within 30 sec, but it seems gunicorn simply kills the process if it has not finished within 30 sec even though it is still running and producing data.
The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted.
But it seems that workers are killed after 30sec even though they still produce data?
The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted. But it seems that workers are killed after 30sec even though they still produce data?
@kurt-hectic that documentation should be improved. By silent, we mean silent from the perspective of the arbiter process, which communicates with the workers through a temporary file. If the worker is busy sending data, it does not update that file. From the perspective of the arbiter, the worker is missing heartbeats.
See also #1974.
Using any worker other than the default, synchronous worker should mitigate the issue with streaming large bodies, because the other workers will heartbeat the temporary file even when they are handling requests.
@kurt-hectic I re-checked once again the -k gevent
option and inserted sleep(0)
between generator iterations and it did actually work for me (I am not sure why it didn't work at the time I posted the question though).
It seems there have been already several reports related to
[CRITICAL] WORKER TIMEOUT
error but it just keeps popping up. Here is my issue.I'm running this Flask hello world application.
The gunicorn command is this one:
And this is the console output:
Can you please clearly explain why do I get the error and if it's expected in this example? How do I fix it or if it's an expected behavior why critical error then?