bigunyak commented 6 years ago

It seems there have been already several reports related to [CRITICAL] WORKER TIMEOUT error but it just keeps popping up. Here is my issue.

I'm running this Flask hello world application.

from flask import Flask
application = Flask(__name__)

@application.route('/')
def hello_world():
    return 'Hello, World!'

The gunicorn command is this one:

gunicorn -b 0.0.0.0:5000 --log-level=debug hello

And this is the console output:

[2018-06-05 14:56:21 +0200] [11229] [INFO] Starting gunicorn 19.8.1
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] Arbiter booted
[2018-06-05 14:56:21 +0200] [11229] [INFO] Listening at: http://0.0.0.0:5000 (11229)
[2018-06-05 14:56:21 +0200] [11229] [INFO] Using worker: sync
[2018-06-05 14:56:21 +0200] [11232] [INFO] Booting worker with pid: 11232
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] 1 workers
[2018-06-05 14:56:32 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:56:57 +0200] [11232] [DEBUG] Closing connection. 
[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-05 14:57:47 +0200] [11232] [INFO] Worker exiting (pid: 11232)
[2018-06-05 14:57:47 +0200] [11324] [INFO] Booting worker with pid: 11324

Can you please clearly explain why do I get the error and if it's expected in this example? How do I fix it or if it's an expected behavior why critical error then?

tilgovi commented 6 years ago

The error is not expected, but there is nothing from your example that shows why it happens. Tell us more about your environment.

What client is used to connect to Gunicorn?
What reverse proxy, if any, is used to connect to Gunicorn?

bigunyak commented 6 years ago

Tell us more about your environment.

What client is used to connect to Gunicorn? => I just send request from Chromium: http://localhost:5000/
What reverse proxy, if any, is used to connect to Gunicorn? => No proxy, just Gunicorn + Flask

I've just reproduced the problem on a completely fresh setup, here are the steps:

mkdir gunicorn
cd gunicorn/
pipenv --python 3.6
pipenv install flask
pipenv install gunicorn
vim hello.py
pipenv shell
gunicorn -b 0.0.0.0:5000 --log-level=debug hello

The hello.py is exactly the same Flask application as I posted in the initial report. Below is the complete log.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/work/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f9757112d08>
  on_reload: <function OnReload.on_reload at 0x7f9757112e18>
  when_ready: <function WhenReady.when_ready at 0x7f9757112f28>
  pre_fork: <function Prefork.pre_fork at 0x7f9756c230d0>
  post_fork: <function Postfork.post_fork at 0x7f9756c231e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9756c232f0>
  worker_int: <function WorkerInt.worker_int at 0x7f9756c23400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f9756c23510>
  pre_exec: <function PreExec.pre_exec at 0x7f9756c23620>
  pre_request: <function PreRequest.pre_request at 0x7f9756c23730>
  post_request: <function PostRequest.post_request at 0x7f9756c237b8>
  child_exit: <function ChildExit.child_exit at 0x7f9756c238c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f9756c239d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9756c23ae8>
  on_exit: <function OnExit.on_exit at 0x7f9756c23bf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-06-06 09:16:21 +0200] [19829] [INFO] Starting gunicorn 19.8.1
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Arbiter booted
[2018-06-06 09:16:21 +0200] [19829] [INFO] Listening at: http://0.0.0.0:5000 (19829)
[2018-06-06 09:16:21 +0200] [19829] [INFO] Using worker: sync
[2018-06-06 09:16:21 +0200] [19832] [INFO] Booting worker with pid: 19832
[2018-06-06 09:16:22 +0200] [19829] [DEBUG] 1 workers
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
[2018-06-06 09:17:19 +0200] [19832] [INFO] Worker exiting (pid: 19832)
[2018-06-06 09:17:19 +0200] [19872] [INFO] Booting worker with pid: 19872
^C[2018-06-06 09:17:26 +0200] [19829] [INFO] Handling signal: int
[2018-06-06 09:17:26 +0200] [19872] [INFO] Worker exiting (pid: 19872)
[2018-06-06 09:17:26 +0200] [19829] [INFO] Shutting down: Master

~$ pip list
Package      Version
------------ -------
click        6.7    
Flask        1.0.2  
gunicorn     19.8.1 
itsdangerous 0.24   
Jinja2       2.10   
MarkupSafe   1.0    
pip          10.0.1 
setuptools   39.2.0 
Werkzeug     0.14.1 
wheel        0.31.1

yunstanford commented 6 years ago

@bigunyak I think it's because of the default timeout, your worker has been silent for 30s. http://docs.gunicorn.org/en/stable/settings.html#timeout

From your log,

[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)

[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)

yunstanford commented 6 years ago

I'm seeing the same thing: workers are timing out even when serving no requests, with sync worker. To that sense, the critical level log is quite confusing.

Try use Gevent worker could solve this.

bigunyak commented 6 years ago

To that sense, the critical level log is quite confusing.

Exactly, that was my original question: if it's an expected behavior why critical error then? Would be also nice to get some background on why workers need to be restarted, maybe this could be added to the Design document.

berkerpeksag commented 6 years ago

I'm seeing this as well (reproduced by using examples/test.py with gunicorn test:app -b localhost:9595 --log-level=debug --timeout=5) and I agree that the critical level is a bit confusing. I'd be OK change it to debug level. @benoitc @tilgovi what do you think?

yunstanford commented 6 years ago

I'm thinking info level might be a little bit better.

neocolor commented 6 years ago

I had the same with MSYS2 on Win10 but finally could solved.

in notify() of ...\gunicorn\workers\workertmp.py, os.fchmod is used originally. But it does not work in MSYS. Instead of os.fchmod, I used os.utime. The code is followed. I think it could work for all platform.

    def notify(self):
        try:
            self.spinner = (self.spinner + 1) % 2
            os.fchmod(self._tmp.fileno(), self.spinner)
            if PLATFORM.startswith('MSYS') :
                os.utime(self._tmp.fileno(), None)
        except AttributeError:
            # python < 2.6
            self._tmp.truncate(0)
            os.write(self._tmp.fileno(), b"X")

benoitc commented 6 years ago

@berkerpeksag I wouldn't expect that the worker exit because no requests happen. This error should only happen if the worker has been kept busy for a time > to the timeout. So the error is critical. Imo we should improve the documentation to provide more use cases and responses to such errors.

If the error still happen when the worker is not kept busy then there is something else happening and we have probably a bug.

kozlek commented 6 years ago

[EDIT] Same bug for me. With Django 1.10 / gunicorn 19.6.0 / Python 2.7.15 in python2.7-alpine on Debian 8.8 and stock kernel 3.16, all was working fine. After updating to Django 1.11 and gunicorn 19.8.1, the workers keep failing at boot with [CRITICAL WORKER TIMEOUT]. Downgrading gunicorn to 19.6.0 doesn't fix the problem. We updated the host kernel to 4.9.0 (it was scheduled), and the workers successfully booted without errors. But:

we use 4 workers
after exactly 4 calls to the Django app, next calls will timeout, showing [CRITICAL WORKER TIMEOUT] in the logs
the linux top command shows 4 gunicorn processes stuck with pretty high CPU consumption.

We're going to try gunicorn gevent to see if we'are able to get our app back online.

kozlek commented 6 years ago

Using gunicorn with gevent didn't fixed the bug.

baeriswyld commented 6 years ago

any update on this issue ?

tilgovi commented 6 years ago

It looks like @neocolor identified a real bug under MSYS. It might deserve a separate issue.

@bigunyak what platform are you running under? I have tried to reproduce with the simple example and I cannot do it following exactly the steps outlined above. This agrees with my experience running multiple applications in production on multiple frameworks. The worker notification system has not changed recently, to my knowledge. My platform is Python 3.7 on MacOS 10.13.6, but I run Gunicorn in production with sync workers for several applications on Python 2.7 and 3.6.5 and only see worker timeouts when there are legitimately long requests that block the workers.

For @Tberdy: what happens if you try to set --worker-tmp-dir to somewhere outside the tmpfs filesystem? I am just wondering if maybe alpine or docker or the combination interferes somehow with how workers notify the arbiter.

tilgovi commented 6 years ago

See also #1388 for Docker related tmpfs issues.

timoj58 commented 6 years ago

i have this issue to.

cjmacharia commented 6 years ago

I have this issue too, gunicorn sync was working perfectly well until yesternight, ut started reporting, workers timeout [CRITICAL] using gevent seems to solve my issue, but I'd really want to know why this happened .

benoitc commented 6 years ago

@timoj58 @cjmash can you provides more detail about the issue ? How are you running gunicorn (in a vm?, options ...), which fileystem, OS? ANthing that could help to reproduce would be very helpful :)

cjmacharia commented 6 years ago

@benoitc I am running gunicorn to start my Django project on kubernetes my gunicorn arguments are --bind=$port --workers=7 --timeout=1200 --log-level=debug --access-logfile - error-logfile -"

the errors i get from the logs


E  [2018-08-09 21:47:56 +0000] [14] [INFO] Booting worker with pid: 14

E  [2018-08-09 21:47:56 +0000] [12] [INFO] Booting worker with pid: 12

E  [2018-08-09 21:47:56 +0000] [1] [DEBUG] 7 workers

E  [2018-08-09 21:47:56 +0000] [11] [INFO] Booting worker with pid: 11

E  [2018-08-09 21:47:55 +0000] [10] [INFO] Booting worker with pid: 10

E  [2018-08-09 21:47:55 +0000] [9] [INFO] Booting worker with pid: 9

E  [2018-08-09 21:47:55 +0000] [8] [INFO] Booting worker with pid: 8

E  [2018-08-09 21:47:55 +0000] [1] [INFO] Using worker: sync

E  [2018-08-09 21:47:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)

E  [2018-08-09 21:47:55 +0000] [1] [DEBUG] Arbiter booted

E  [2018-08-09 21:47:55 +0000] [1] [INFO] Starting gunicorn 19.7.1

E    raw_paste_global_conf: []

E    ciphers: TLSv1

E    do_handshake_on_connect: False

E    suppress_ragged_eofs: True

E    ca_certs: None

E    cert_reqs: 0

E    ssl_version: 2

E    certfile: None

E    keyfile: None

E    proxy_allow_ips: ['127.0.0.1']

E    proxy_protocol: False

E    on_exit: <function OnExit.on_exit at 0x7f9612aad7b8>

E    nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9612aad620>

E    worker_exit: <function WorkerExit.worker_exit at 0x7f9612aad488>

E    child_exit: <function ChildExit.child_exit at 0x7f9612aad2f0>

E    post_request: <function PostRequest.post_request at 0x7f9612aad158>

E    pre_request: <function PreRequest.pre_request at 0x7f9612aad048>

E    pre_exec: <function PreExec.pre_exec at 0x7f9612aa8e18>

E    worker_abort: <function WorkerAbort.worker_abort at 0x7f9612aa8c80>

E    worker_int: <function WorkerInt.worker_int at 0x7f9612aa8ae8>

E    post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9612aa8950>

E    post_fork: <function Postfork.post_fork at 0x7f9612aa87b8>

E    pre_fork: <function Prefork.pre_fork at 0x7f9612aa8620>

E    when_ready: <function WhenReady.when_ready at 0x7f9612aa8488>

E    on_reload: <function OnReload.on_reload at 0x7f9612aa82f0>

E    on_starting: <function OnStarting.on_starting at 0x7f9612aa8158>

E    paste: None

E    pythonpath: None

E    default_proc_name: art.wsgi

E    proc_name: None

E    statsd_prefix: 

E    statsd_host: None

E    enable_stdio_inheritance: False

E    syslog_facility: user

E    syslog_prefix: None

E    syslog: False

E    syslog_addr: udp://localhost:514

E    logconfig: None

E    logger_class: gunicorn.glogging.Logger

E    capture_output: False

E    loglevel: debug

E    errorlog: -

E    access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"

E    accesslog: -

E    forwarded_allow_ips: ['127.0.0.1']

E    secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}

E    tmp_upload_dir: None

E    initgroups: False

E    umask: 0

E    group: 0

E    user: 0

E    worker_tmp_dir: None

E    pidfile: None

E    raw_env: []

E    daemon: False

E    chdir: /usr/src/app

E    sendfile: None

E    preload_app: False

E    check_config: False

E    spew: False

E    reload_engine: auto

E    reload: False

E    limit_request_field_size: 8190

E    limit_request_fields: 100

E    limit_request_line: 4094

E    keepalive: 2

E    graceful_timeout: 30

E    timeout: 1200

E    max_requests_jitter: 0

E    max_requests: 0

E    worker_connections: 1000

E    threads: 1

E    worker_class: sync

E    workers: 7

E    backlog: 2048

E    bind: ['0.0.0.0:4040']

E    config: None

E  [2018-08-09 21:47:55 +0000] [1] [DEBUG] Current configuration:

I    No migrations to apply.

I  Running migrations:

I    Apply all migrations: admin, auth, contenttypes, core, dashboard, jet, oauth2_provider, sessions

I  Operations to perform:

E  [2018-08-09 21:47:20 +0000] [13] [INFO] Worker exiting (pid: 13)

E    os.path.dirname(os.path.dirname(__file__)), '.env'))

E  /usr/src/app/art/wsgi.py:19: UserWarning: Not reading /usr/src/app/.env - it doesn't exist.

E  [2018-08-09 21:47:00 +0000] [21] [INFO] Booting worker with pid: 21

E  [2018-08-09 21:47:00 +0000] [1] [INFO] Handling signal: term

E  [2018-08-09 21:46:35 +0000] [12] [INFO] Booting worker with pid: 12

E  [2018-08-09 21:46:34 +0000] [13] [INFO] Booting worker with pid: 13

E  [2018-08-09 21:46:34 +0000] [11] [INFO] Booting worker with pid: 11

E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] 7 workers

E  [2018-08-09 21:46:34 +0000] [10] [INFO] Booting worker with pid: 10

E  [2018-08-09 21:46:34 +0000] [9] [INFO] Booting worker with pid: 9

E  [2018-08-09 21:46:34 +0000] [8] [INFO] Booting worker with pid: 8

E  [2018-08-09 21:46:34 +0000] [7] [INFO] Booting worker with pid: 7

E  [2018-08-09 21:46:34 +0000] [1] [INFO] Using worker: sync

E  [2018-08-09 21:46:34 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)

E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] Arbiter booted

E  [2018-08-09 21:46:34 +0000] [1] [INFO] Starting gunicorn 19.7.1

E    raw_paste_global_conf: []

E    ciphers: TLSv1

E    do_handshake_on_connect: False

E    suppress_ragged_eofs: True

E    ca_certs: None

E    cert_reqs: 0

E    ssl_version: 2

E    certfile: None

E    keyfile: None

E    proxy_allow_ips: ['127.0.0.1']

E    proxy_protocol: False

E    on_exit: <function OnExit.on_exit at 0x7f07f2cfe7b8>

E    nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f07f2cfe620>

E    worker_exit: <function WorkerExit.worker_exit at 0x7f07f2cfe488>

E    child_exit: <function ChildExit.child_exit at 0x7f07f2cfe2f0>

E    post_request: <function PostRequest.post_request at 0x7f07f2cfe158>

E    pre_request: <function PreRequest.pre_request at 0x7f07f2cfe048>

E    pre_exec: <function PreExec.pre_exec at 0x7f07f2cf9e18>

E    worker_abort: <function WorkerAbort.worker_abort at 0x7f07f2cf9c80>

E    worker_int: <function WorkerInt.worker_int at 0x7f07f2cf9ae8>

E    post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f07f2cf9950>

E    post_fork: <function Postfork.post_fork at 0x7f07f2cf97b8>

E    pre_fork: <function Prefork.pre_fork at 0x7f07f2cf9620>

E    when_ready: <function WhenReady.when_ready at 0x7f07f2cf9488>

E    on_reload: <function OnReload.on_reload at 0x7f07f2cf92f0>

E    on_starting: <function OnStarting.on_starting at 0x7f07f2cf9158>

E    paste: None

E    pythonpath: None

E    default_proc_name: art.wsgi

E    proc_name: None

E    statsd_prefix: 

E    statsd_host: None

E    enable_stdio_inheritance: False

E    syslog_facility: user

E    syslog_prefix: None

E    syslog: False

E    syslog_addr: udp://localhost:514

E    logconfig: None

E    logger_class: gunicorn.glogging.Logger

E    capture_output: False

E    loglevel: debug

E    errorlog: -

E    access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"

E    accesslog: -

E    forwarded_allow_ips: ['127.0.0.1']

E    secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}

E    tmp_upload_dir: None

E    initgroups: False

E    umask: 0

E    group: 0

E    user: 0

E    worker_tmp_dir: None

E    pidfile: None

E    raw_env: []

E    daemon: False

E    chdir: /usr/src/app

E    sendfile: None

E    preload_app: False

E    check_config: False

E    spew: False

E    reload_engine: auto

E    reload: False

E    limit_request_field_size: 8190

E    limit_request_fields: 100

E    limit_request_line: 4094

E    keepalive: 2

E    graceful_timeout: 30

E    timeout: 1200

E    max_requests_jitter: 0

E    max_requests: 0

E    worker_connections: 1000

E    threads: 1

E    worker_class: sync

E    workers: 7

E    backlog: 2048

E    bind: ['0.0.0.0:4040']

E    config: None

E  [2018-08-09 21:46:34 +0000] [1] [DEBUG] Current configuration:

bigunyak commented 6 years ago

I struggled a bit to reproduce the problem this time but it's still there in the latest gunicorn version 19.9.0. The steps to reproduce it are exactly the same as I described in this post. My system is Arch Linux x86_64 GNU/Linux (kernel 4.17.2-1-ARCH), Python 3.6.5 Here is a fresh log trace.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/lerning/python/modules/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f5f8b9dad08>
  on_reload: <function OnReload.on_reload at 0x7f5f8b9dae18>
  when_ready: <function WhenReady.when_ready at 0x7f5f8b9daf28>
  pre_fork: <function Prefork.pre_fork at 0x7f5f8b4ec0d0>
  post_fork: <function Postfork.post_fork at 0x7f5f8b4ec1e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f5f8b4ec2f0>
  worker_int: <function WorkerInt.worker_int at 0x7f5f8b4ec400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f5f8b4ec510>
  pre_exec: <function PreExec.pre_exec at 0x7f5f8b4ec620>
  pre_request: <function PreRequest.pre_request at 0x7f5f8b4ec730>
  post_request: <function PostRequest.post_request at 0x7f5f8b4ec7b8>
  child_exit: <function ChildExit.child_exit at 0x7f5f8b4ec8c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f5f8b4ec9d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f5f8b4ecae8>
  on_exit: <function OnExit.on_exit at 0x7f5f8b4ecbf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-08-10 09:48:40 +0200] [23114] [INFO] Starting gunicorn 19.9.0
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Arbiter booted
[2018-08-10 09:48:40 +0200] [23114] [INFO] Listening at: http://0.0.0.0:5000 (23114)
[2018-08-10 09:48:40 +0200] [23114] [INFO] Using worker: sync
[2018-08-10 09:48:40 +0200] [23117] [INFO] Booting worker with pid: 23117
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] 1 workers
[2018-08-10 09:48:45 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:48:54 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:00 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:23 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:50 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:51:11 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:13 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:43 +0200] [23114] [CRITICAL] WORKER TIMEOUT (pid:23117)
[2018-08-10 09:51:43 +0200] [23117] [INFO] Worker exiting (pid: 23117)
[2018-08-10 09:51:44 +0200] [23229] [INFO] Booting worker with pid: 23229

As before to test it I was just hitting http://0.0.0.0:5000/ in Chromium. Below is the content of Pipfile and Pipfile.lock for you to see the exact environment.

Pipfile


[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages] flask = "" gunicorn = ""

[dev-packages]

[requires] python_version = "3.6"

* Pipfile.lock
```json
{
    "_meta": {
        "hash": {
            "sha256": "81cb5d5f0b11719d8d9c5ec9cc683fdcf959c652fda256d5552a82d0f459a99c"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.6"
        },
        "sources": [
            {
                "name": "pypi",
                "url": "https://pypi.org/simple",
                "verify_ssl": true
            }
        ]
    },
    "default": {
        "click": {
            "hashes": [
                "sha256:29f99fc6125fbc931b758dc053b3114e55c77a6e4c6c3a2674a2dc986016381d",
                "sha256:f15516df478d5a56180fbf80e68f206010e6d160fc39fa508b65e035fd75130b"
            ],
            "version": "==6.7"
        },
        "flask": {
            "hashes": [
                "sha256:2271c0070dbcb5275fad4a82e29f23ab92682dc45f9dfbc22c02ba9b9322ce48",
                "sha256:a080b744b7e345ccfcbc77954861cb05b3c63786e93f2b3875e0913d44b43f05"
            ],
            "index": "pypi",
            "version": "==1.0.2"
        },
        "gunicorn": {
            "hashes": [
                "sha256:aa8e0b40b4157b36a5df5e599f45c9c76d6af43845ba3b3b0efe2c70473c2471",
                "sha256:fa2662097c66f920f53f70621c6c58ca4a3c4d3434205e608e121b5b3b71f4f3"
            ],
            "index": "pypi",
            "version": "==19.9.0"
        },
        "itsdangerous": {
            "hashes": [
                "sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519"
            ],
            "version": "==0.24"
        },
        "jinja2": {
            "hashes": [
                "sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd",
                "sha256:f84be1bb0040caca4cea721fcbbbbd61f9be9464ca236387158b0feea01914a4"
            ],
            "version": "==2.10"
        },
        "markupsafe": {
            "hashes": [
                "sha256:a6be69091dac236ea9c6bc7d012beab42010fa914c459791d627dad4910eb665"
            ],
            "version": "==1.0"
        },
        "werkzeug": {
            "hashes": [
                "sha256:c3fd7a7d41976d9f44db327260e263132466836cef6f91512889ed60ad26557c",
                "sha256:d5da73735293558eb1651ee2fddc4d0dedcfa06538b8813a2e20011583c9e49b"
            ],
            "version": "==0.14.1"
        }
    },
    "develop": {}
}

jdeltoft commented 6 years ago

Just FYI, I'm also seeing this failure very regularly with:

nohup gunicorn -w 8 --access-logfile=- --bind 127.0.0.1:5000 wsgi &

I just have a wsgi.py that has

from chart_api import application
if __name__ == "__main__":
    application.run()

Let me know if there is anything you want me to try/experiment or if there are specifics in the logs you want me to check. I'm running flask on a GCP VM.

timoj58 commented 6 years ago

sorry for late reply. I am running it as

gunicorn --log-file=/home/ubuntu/log/gunicorn.log predictor_api:app -b localhost:5000 &

i did use the gevent setting etc, but i have changed my design of what i required this for in order to work around issue, hence the basic setting above (which also failed but this is to be expected given no gevent)

Python version 3.6 env: aws tensorflow_p36 optmized (ubuntu) nginx is sat in front of gunicorn, which is executing a flask app.

flask version 1.0.2 nginx version 1.10.3 gunicorn version 19.9.0

I did change the nginx timeouts as well incase this may have caused it.

tiru1930 commented 6 years ago

Facing the same problem with gunicorn server

gunicorn ApplicationServer:app -b 0.0.0.0:6001 -w 8 --threads 4 --backlog 2048 \

--timeout 120 --graceful-timeout 60 --access-logfile logs/access.log \

--error-logfile logs/error.log --log-level=info

Flask==0.12.1 gunicorn==19.7.1

when i start server with above command, my system got freezing for some time and worker pids are keeping on booting , though i kept timeout 120 sec, and the server is not accepting single request.

anilpai commented 6 years ago

Any update on this issue ? I have same issue

[CRITICAL] WORKER TIMEOUT

javabrett commented 6 years ago

Wondering whether anyone has successfully reproduced this in a Docker image?

SuiTissue commented 6 years ago

Also seeing this when trying to implement datadog's ddtrace-run worker on an existing application starting with gunicorn -k gevent --threads 4.

Funny trace of a SystemExit I've never seen before too... [2018-11-07 11:11:50 +0000] [15987] [INFO] Booting worker with pid: 15987 [2018-11-07 11:11:50 +0000] [15977] [DEBUG] 1 workers [2018-11-07 11:12:20 +0000] [15977] [CRITICAL] WORKER TIMEOUT (pid:15987) Exception SystemExit: 1 in <bound method LibevLoop._loop_will_run of <cassandra.io.libevreactor.LibevLoop object at 0x7f3cb66d4a50>> ignored

anilpai commented 6 years ago

I am able to resolve this issue by matching the number of workers and number of threads.

I had set workers = (2 * cpu_count) + 1 and did not set threads.

Once I changed threads = workers , everything started working fine. Just in case, if this helps someone.

This is how it looks now

def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
    """Run the app with Gunicorn."""

    if app.debug:
        app.run(host, int(port), use_reloader=False)
    else:
        gunicorn = WSGIApplication()
        gunicorn.load_wsgiapp = lambda: app
        gunicorn.cfg.set('bind', '%s:%s' % (host, port))
        gunicorn.cfg.set('workers', workers)
        gunicorn.cfg.set('threads', workers)
        gunicorn.cfg.set('pidfile', None)
        gunicorn.cfg.set('worker_class', 'sync')
        gunicorn.cfg.set('keepalive', 10)
        gunicorn.cfg.set('accesslog', '-')
        gunicorn.cfg.set('errorlog', '-')
        gunicorn.cfg.set('reload', True)
        gunicorn.chdir()
        gunicorn.run()

clintonb commented 5 years ago

I encountered this issue running Django with a single Docker container on AWS Elastic Beanstalk. I resolved the issue by fixing my security groups to ensure my EC2 instance could talk to my RDS instance. I recognize this may not be the solution for 99% of folks on this issue, but I'm leaving this note to help others avoid wasting hours falling down this rabbit hole.

fred-revel commented 5 years ago

In my case the issue was resolved by performing systemctl daemon-reload. Long story short I knew of the default timeout at 30sec, but was unable to change it until i figured i had to of course reload the systemd daemon to apply the changes to the service.

asnisarenko commented 5 years ago

@bigunyak @benoitc @tilgovi After a long google goose chase and some experimentation I believe the root cause of this issue is chrome "preconnect/prediction service" (which is enabled by default in both Chrome and Chromium).

@jeiting did a nice write up about this issue https://hackernoon.com/chrome-preconnect-breaks-singly-threaded-servers-95944be16400 Some additional reading: https://github.com/corydolphin/flask-cors/issues/147 https://github.com/pallets/flask/issues/2169

TLDR In some cases Chrome/Chromium opens and holds an "empty" TCP connection(s) (i.e its predicting that its going to fetch another resource soon). If the "empty" TCP connection(s) hits your gunicorn server first then any subsequent "real" request from chrome can get stuck behind the "empty" request until the worker(s) handling the "empty" request times out. This is more likely to happen if you are only running a single sync worker in gunicorn. But as my experiments found it can happen even if you are running multiple sync workers.

My Environment

My native OS is Ubuntu 17

I have a rest API running in a local docker container. Relevant lines from the Dockefile

FROM ubuntu:18.04
...
RUN pip3 install Flask==1.0.2
RUN pip3 install gunicorn==19.9.0
RUN pip3 install flask-cors==3.0.6
......

I have 3 browsers installed: Firefox 61.0.1, Chromium 67.0.3396.99, Chrome 70.0.3538.102
I have a react app that is served from a different docker container on a different port
The react app makes CORS AJAX requests to my API (they are CORS requests because the javascript is served on one localhost port while its sending API calls to another localhost port)
The request that always gets "blocked" in my experiments is a CORS OPTIONS request (triggered by the browser to ask for permission to then perform a POST request ). Seems logical that a browser would try to make predictive connections with the OPTIONS call because it's likely going to follow up with a POST call.

Experiment 1 Gunicorn config: 1 sync worker (default 30s timeout)

Firefox: On almost every react page load, the CORS OPTIONS request gets blocked for 5s, and then succeeds. Chromium: On EVERY react page load, CORS OPTIONS request gets blocked for 1.5minutes!!!! and then succeeds. Chromium(prediction service disabled): Everything loads fine Chrome: Everything loads fine

Experiment 2 Gunicorn config: 4 sync workers (default 30s timeout)

Firefox: Everything loads fine Chromium: On every 3rd react page load, CORS OPTIONS request gets blocked for 30s and then succeeds. Chromium(prediction service disabled): Everything loads fine Chrome: Everything loads fine

Experiment 3 Gunicorn config: 8 sync workers (default 30s timeout)

Firefox: Everything loads fine Chromium: Everything loads fine Chrome: Everything loads fine

Experiment 4 Run flask dev server with threaded=True

Firefox: Everything loads fine Chromium: Everything loads fine Chrome: Everything loads fine

Experiment Summary

Firefox seems to also have prediction service, but it seems more graceful. (Experiment 1)
Chromium seems the most aggressive in creating "prediction" connections. It appears to launch up to 3-4 "prediction" connections during an OPTIONS request (Experiment 1 and 2)
Surprisingly even though Chrome also has prediction service enabled it didn't cause any issues in any of the experiments (maybe the order or the number of "prediction" connections is different from Chromium)

Solution In Production: Easiest fix is to put nginx in front of gunicorn. If you want to understand why nginx solves this issue, here is nice article that explains it: https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/

In dev: Easiest fix is to run flask dev server with threaded=True. Or you can disable the prediction service in Chrome/Chromium.

gunicorn debug improvements To help debug an issue like this in the future, I think it's a good idea to add debug log statements next to the select() and accept() calls in the sync worker. https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L26 https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L34 This would show that a worker has accepted a new TCP connection, but is not receiving any data.

javabrett commented 5 years ago

@asnisarenko super write-up, thanks. #1929 is another unusual case of a slow-client which yields similar symptom in the single-threaded server - in that case, TLS handshake to a non-TLS port looks like a slow-writing client, since it fails to send sensible headers quickly.

It may be the Gunicorn, for single-threaded sync workers is going to need some new tunable, an aggressive client-drop for clients which fail to send at least a first request-header line, if not all headers, within a reasonable time.

ikonst commented 5 years ago

If this happens in your app, you can add some temporary diagnostic code:

def trace_on_abort():
    import signal
    import traceback

    def print_trace(sig, frame):
        print(''.join(traceback.format_stack(frame)))

    signal.signal(signal.SIGABRT, print_trace)

Hopefully this should reveal where your app hangs.

For example, I somehow managed to import flask before gevent.monkey.patch_all() and so my Flask's app._before_request_lock ended up being a non-gevent lock (i.e. plain unpatched threading.Lock). In such a case, two requests in close succession on app start-up would cause it to hang (2nd request would lock the entire thread). But this was my bug, and your bug may differ.

@asnisarenko I'm not sure why you're experiencing a hang. You absolutely do not need a worker per connection, since every worker can cooperatively handle thousands of connections (as long as the handling code doesn't monopolize the CPU / lock the entire thread). Specifically, if the browser stalls the request, it simply means that when gunicorn read()s from the socket, it will block a greenlet, but since read should be monkey-patched it'll yield to the other greenlets.

asnisarenko commented 5 years ago

@ikonst This issue is talking about running gunicorn with default configuration. The default worker class is sync http://docs.gunicorn.org/en/stable/settings.html#worker-class

sync workers use a pre-fork model and each worker processes one TCP connection/request at a time.

bobf commented 5 years ago

I don't know what caused this for me, but switching from the default sync worker type to eventlet resolved the problem:

CMD pipenv run gunicorn webapp -b 0.0.0.0:8080 -k eventlet

Good luck.

edthrn commented 5 years ago

Maybe this is gonna help only me, but since it took me 7h to debug this issue on my side (a Flask/Gunicorn app running in Docker on EC2), if I can spare this headache to a few people, it's a small victory.

The container memory limit that was too low. I initially set it up at 100 MB, but apparently the app is requesting about 110 MB when running... Getting rid of the limit prevented my workers to timeout whith (apparently) no reason.

If I have memory leaks in the future, I'll try to put a higher memory limit, but for now, no limit will do.

Thanks

malinich commented 5 years ago

def trace_on_abort():
    import signal
    import traceback

    def print_trace(sig, frame):
        print(''.join(traceback.format_stack(frame)))

    signal.signal(signal.SIGABRT, print_trace)

You ыo much help me. EHU )))

mattg-vbt commented 5 years ago

Where does this trace_on_abort() method go? In the app code file? Does it need an annotation?

ikonst commented 5 years ago

@mattg-vbt add this to the beginning of your app when you need to debug, and remove it once you're done debugging

mattg-vbt commented 5 years ago

@ikonst I added this to my app, but I don't see it getting hit. I get the worker timeout, but this method isn't hit.

ikonst commented 5 years ago

@mattg-vbt can you try doing kill -ABRT $pid to see the print_trace function being called? (the idea is that you'll get SIGABRT from werkzeug when your worker times out, but let's first check that it's called at all)

SumNeuron commented 5 years ago

@asnisarenko for the less savvy amongst us how then would I update the procfile?

web gunicorn app:app -k gevent --worker-connections 1000 ?

asnisarenko commented 5 years ago

@SumNeuron If you are using gevent workers you (instead of sync workers) then you should not be hitting this issue. So the command you posted seems fine.

orestis-z commented 5 years ago

I'm getting this issue with gevent workers only upon initialization, which is a bit annoying as I am processing some tasks before launching my app. But I am setting a high timeout for now. I've posted a reproducible test repo here: https://github.com/zamponotiropita/test-gunicorn-worker-timeout -> test_0 fails, test_1 and test_2 pass

ikonst commented 5 years ago

@zamponotiropita Are you trying to do this per-worker or per-app before the forking?

orestis-z commented 5 years ago

@ikonst please check the run file run.sh, it's per worker. The problem didn't come up with preloading, but I had issues connecting to the database when preloading, since the app object (and with it together the database connection) gets copied and passed the processes when forking -> my database couldn't handle identical connections of master and workers, and I couldn't find a workaround

ltskv commented 5 years ago

I am having a similar issue. I am trying to generate a lot of data on-the-fly with Flask using this method, and it fails because the worker times out after whatever was set in --timeout expires. A minimal example to reproduce:

test_gunicorn_timeout.py

import flask
from time import sleep

app = flask.Flask(__name__)

@app.route('/')
def gunicorn_timeout():
    def generator():
        for _ in range(10):
            yield b'Yet another line...'
            sleep(2)
    return flask.Response(generator(), mimetype='text/plain')

Then run gunicorn --timeout 10 test_gunicorn_timeout:app and when requesting localhost:8000 after 10 seconds you get a [CRITICAL] WORKER TIMEOUT.

I also tried running with -k gevent and -k eventlet and nothing changed.

ShubhamKhandare commented 5 years ago

I am on windows 10 Pro. With docker compose gunicorn app -b 0.0.0.0:8000 -k gevent Worked for me by installing gevent in my python container How to install gevent

kurt-hectic commented 5 years ago

Then run gunicorn --timeout 10 test_gunicorn_timeout:app and when requesting localhost:8000 after 10 seconds you get a [CRITICAL] WORKER TIMEOUT.

I also tried running with -k gevent and -k eventlet and nothing changed.

@ltskv

have you found a solution to the [CRITICAL] WORKER TIMEOUT issue?

I have a similar problem with a flask streaming endpoint I am trying to deploy on heroku (which uses gunicorn). There, the proxy frontend is supposed to keep the connection alive as long as some data is sent within 30 sec, but it seems gunicorn simply kills the process if it has not finished within 30 sec even though it is still running and producing data.

The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted. But it seems that workers are killed after 30sec even though they still produce data?

tilgovi commented 5 years ago

The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted. But it seems that workers are killed after 30sec even though they still produce data?

@kurt-hectic that documentation should be improved. By silent, we mean silent from the perspective of the arbiter process, which communicates with the workers through a temporary file. If the worker is busy sending data, it does not update that file. From the perspective of the arbiter, the worker is missing heartbeats.

benoitc / gunicorn

CRITICAL WORKER TIMEOUT when running Flask app #1801

gunicorn ApplicationServer:app -b 0.0.0.0:6001 -w 8 --threads 4 --backlog 2048 \

--timeout 120 --graceful-timeout 60 --access-logfile logs/access.log \

--error-logfile logs/error.log --log-level=info