benoitc / gunicorn

gunicorn 'Green Unicorn' is a WSGI HTTP Server for UNIX, fast clients and sleepy applications.
http://www.gunicorn.org
Other
9.76k stars 1.74k forks source link

CRITICAL WORKER TIMEOUT when running Flask app #1801

Closed bigunyak closed 4 years ago

bigunyak commented 6 years ago

It seems there have been already several reports related to [CRITICAL] WORKER TIMEOUT error but it just keeps popping up. Here is my issue.

I'm running this Flask hello world application.

from flask import Flask
application = Flask(__name__)

@application.route('/')
def hello_world():
    return 'Hello, World!'

The gunicorn command is this one:

gunicorn -b 0.0.0.0:5000 --log-level=debug hello

And this is the console output:

[2018-06-05 14:56:21 +0200] [11229] [INFO] Starting gunicorn 19.8.1
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] Arbiter booted
[2018-06-05 14:56:21 +0200] [11229] [INFO] Listening at: http://0.0.0.0:5000 (11229)
[2018-06-05 14:56:21 +0200] [11229] [INFO] Using worker: sync
[2018-06-05 14:56:21 +0200] [11232] [INFO] Booting worker with pid: 11232
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] 1 workers
[2018-06-05 14:56:32 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:56:57 +0200] [11232] [DEBUG] Closing connection. 
[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-05 14:57:47 +0200] [11232] [INFO] Worker exiting (pid: 11232)
[2018-06-05 14:57:47 +0200] [11324] [INFO] Booting worker with pid: 11324

Can you please clearly explain why do I get the error and if it's expected in this example? How do I fix it or if it's an expected behavior why critical error then?

sandeepsign commented 4 years ago

--timeout=5

This is the most common cause of this issue.

lc-lingliang commented 4 years ago

I hope my solution could help you. I met this critical worker timeout problem a few days ago and tried a few solutions. It now works well.

Here are my understanding and solutions:

  1. Try preload in gunicorn

It fails to boot the workers because it needs more time to load the package, such as tensorflow backend, to start the service. So when you are experiencing slow app boot time, try to enable preload option in gunicorn (See https://devcenter.heroku.com/articles/python-gunicorn#advanced-configuration).

gunicorn hello:app --preload

  1. Try to increase the timeout for gunicorn

The default timeout is 30s. If your application really need much time to finish an api, increase the timeout.

gunicorn hello:app --timeout 10

However, from my perspective, it doesn't make sense if an api need more than 1 minutes to finish. If so, try to make some progress in your code.

  1. If you are using k8s, you can also set a timeoutSeconds for you container/image in yaml.
rubyfin commented 4 years ago

I faced the same issue today. In my case the api was taking about a minute to calculate data and return to the client, which resulted in CRITICAL WORKER TIMEOUT errors. I solved it by increasing the timeout flag for gunicorn to more than a minute - it worked, did not see the issue come back. Hope this helps. I am using uvicorn.workers.UvicornWorker.

alpinechicken commented 4 years ago

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.

bobf commented 4 years ago

Maybe you had a deadlock ? Does your app make requests to itself ?

On Sun, 5 Jan 2020, 10:52 alpinechicken, notifications@github.com wrote:

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJVQRCW3C63EZJWIN5DQ4G3WTA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIDTZIA#issuecomment-570899616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEQJXZM4NLK56DZMFSZALQ4G3WTANCNFSM4FDLD5PA .

alpinechicken commented 4 years ago

Yep one route calls another - is that bad?

bobf commented 4 years ago

It means that you need at least two workers otherwise your server will deadlock. The request will wait until the server responds to the second request (which would be queued).

You get one concurrent request per worker.

On Mon, 6 Jan 2020, 02:45 alpinechicken, notifications@github.com wrote:

Yep one route calls another - is that bad?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJSFEFBBI6AMZJCM4C3Q4KLOJA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIEIEXI#issuecomment-570983005, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEQJXTCPOFIZJU5PUPOODQ4KLOJANCNFSM4FDLD5PA .

alpinechicken commented 4 years ago

Ah that makes sense. Thanks!

On Tue, Jan 7, 2020 at 6:23 AM bobf notifications@github.com wrote:

It means that you need at least two workers otherwise your server will deadlock. The request will wait until the server responds to the second request (which would be queued).

You get one concurrent request per worker.

On Mon, 6 Jan 2020, 02:45 alpinechicken, notifications@github.com wrote:

Yep one route calls another - is that bad?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJSFEFBBI6AMZJCM4C3Q4KLOJA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIEIEXI#issuecomment-570983005 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAAEQJXTCPOFIZJU5PUPOODQ4KLOJANCNFSM4FDLD5PA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAH2WRPVPVO2EJ53BKQW5B3Q4OHLRA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGVJ7Q#issuecomment-571299070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2WRM2LLIB4O6OHCU5UG3Q4OHLRANCNFSM4FDLD5PA .

sambit9238 commented 4 years ago

worker_class', 'sync')

I am able to resolve this issue by matching the number of workers and number of threads.

I had set workers = (2 * cpu_count) + 1 and did not set threads.

Once I changed threads = workers , everything started working fine. Just in case, if this helps someone.

This is how it looks now

def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
    """Run the app with Gunicorn."""

    if app.debug:
        app.run(host, int(port), use_reloader=False)
    else:
        gunicorn = WSGIApplication()
        gunicorn.load_wsgiapp = lambda: app
        gunicorn.cfg.set('bind', '%s:%s' % (host, port))
        gunicorn.cfg.set('workers', workers)
        gunicorn.cfg.set('threads', workers)
        gunicorn.cfg.set('pidfile', None)
        gunicorn.cfg.set('worker_class', 'sync')
        gunicorn.cfg.set('keepalive', 10)
        gunicorn.cfg.set('accesslog', '-')
        gunicorn.cfg.set('errorlog', '-')
        gunicorn.cfg.set('reload', True)
        gunicorn.chdir()
        gunicorn.run()

As per gunicorn doc, it changes the worker class from sync to gthread if more than one threads are mentioned. PS:- If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead.

CharlesHehe commented 4 years ago

My case:

Environment: Ubuntu18.04+ gunicorn+ nginx +flask

pip install gunicorn[gevent] in my virtual environment

Change gunicorn -b localhost:8000 -w 4 web:app to gunicorn -b localhost:8000 -k gevent web:app

It works.

tilgovi commented 4 years ago

Thank you to everyone here who has done so much to help one another resolve their issues. Please continue to post to this issue if it seems appropriate.

However, I am closing this issue because I don't think there is any bug in Gunicorn here and I don't think there is any action to take, although I will happily help review PRs that try to add documentation for this somehow or improve log messages.

Please do not misunderstand my intention. If you suspect a bug in Gunicorn and want to continue discussing, please do so. Preferably, open a new ticket with an example application that reproduces your issue. However, at this point, there are too many different problems, resolutions, and conversations in this issue for it to be very legible.

If you run Gunicorn without a buffering reverse proxy in front of it you will get timeouts with the default, sync worker for any number of reasons. Common ones are:

You can switch to asynchronous or threaded worker types, or you can put Gunicorn behind a buffering reverse proxy. If you know that your timeouts are due to your own code making slow calls to external APIs or doing significant work that you expect, you may increase the --timeout option.

abkeble commented 4 years ago

It means that you need at least two workers otherwise your server will deadlock. The request will wait until the server responds to the second request (which would be queued). You get one concurrent request per worker. On Mon, 6 Jan 2020, 02:45 alpinechicken, @.***> wrote: Yep one route calls another - is that bad?

Is this the case when calling the 'redirect' function as the return value for a route?

tilgovi commented 4 years ago

Is this the case when calling the 'redirect' function as the return value for a route?

No. A flask redirect responds with an HTTP redirect and the worker is then free to accept a new request. The client makes another request when it sees this response and whenever a worker is ready in will receive this request.

Shane-Neeley commented 4 years ago

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.

Is this related to @anilpai comment earlier where he set workers=1 + (multiprocessing.cpu_count() * 2) .. ?

Justice4Joffrey commented 4 years ago

I had a similar issue to this. Turns out I had an error in my entrypoint to the application. From debugging it seemed that I was essentially launching a flask app from gunicorn, who's workers subsequently enter an infinite connection loop which times out every 30s.

I'm sure that this doesn't affect all users above, but may well affect some.

In my module/wsgi.py file which I'm running with gunicorn module.wsgi I had -

application = my_create_app_function()
application.run(host="0.0.0.0")

Whereas I should've had -

application = my_create_app_function()
if __name__ == "__main__":
     application.run(host="0.0.0.0")

Essentially, you don't want to call application.run() when using gunicorn. The __name__ under gunicorn won't be "__main__", but it will in Flask, so you can still debug locally.

I couldn't find a reference to this in the gunicorn docs, but could imagine it being a common error case, so maybe some warning is necessary.

JurajMa commented 4 years ago

This is still occuring. Adding --preload to the Gunicorn call fixed the issue for me.

leonbrag commented 4 years ago

Is this bug still not fixed? I am observing this exact behavior.

Gunicorn starts like this in systemd:

[Service]
PIDFile = /run/gunicorn.pid
WorkingDirectory = /home/pi/pyTest
ExecStart=/usr/local/bin/gunicorn  app:app  -b 0.0.0.0:80 --pid /run/gunicorn.pid
RuntimeDirectory=/home/pi/pyTest
Restart=always
KillSignal=SIGQUIT
Type=notify
StandardError=syslog
NotifyAccess=all
User=root
Group=root
ExecReload = /bin/kill -s HUP $MAINPID
ExecStop = /bin/kill -s TERM $MAINPID
ExecStopPost = /bin/rm -rf /run/gunicorn
PrivateTmp = true

Worker process constantly times out and restarts:

Jul 10 15:19:20 raspberryVM gunicorn[10941]: [2020-07-10 15:19:20 -0700] [10941] [CRITICAL] WORKER TIMEOUT (pid:10944)
Jul 10 15:19:20 raspberryVM gunicorn[10941]: [2020-07-10 15:19:20 -0700] [10944] [INFO] Worker exiting (pid: 10944)
Jul 10 15:20:15 raspberryVM gunicorn[10941]: [2020-07-10 15:20:15 -0700] [10985] [INFO] Booting worker with pid: 10985

app.py is a trival Flask app.

Is this issue closed as Don't Fix?

midhuntp commented 4 years ago

I was also having the same issue

But after Debugging Im able to find that while gunicorn starts Django App one of the dependency was taking longer than the expected time , ( In my case external DB connection ) which make the gunicron worker to timeout

When I resolved the connection issue , timeout issue also resolved ...

leonbrag commented 4 years ago

This would not my case. I tested with “Hello, World” type of app, with no dependencies. So I am still puzzled by this, but it seems it’s not possible to have Gunicorn with long running thread. Worker process restarts and therefore kill the long running thread.

asnisarenko commented 4 years ago

@leonbrag This is likely NOT a gunicorn bug. See my commend above in the thread. It's a side-effect of browsers sending empty "predicted" TCP connections, and running gunicorn with only a few sync workers without protection from empty TCP connections.

leonbrag commented 4 years ago

Is there a reference architecture/design that shows a proper way to set up Gunicorn flask app with long (permanent) worker thread ?

If this is not a bug, then it’s seems an artifact or a limitation of the Gunicorn architecture/design.

Why would not sync worker run forever and accept clients connections. Such worker would close socket as needed, yet continue to run without exIting (and therefor worker thread continue to run).

asnisarenko commented 4 years ago

@leonbrag You should be more specific about what problem you are trying to solve.

The problem discussed in this thread happens in dev environment and the easiest solution is either to add more sync workers or use threaded workers.

If you want to avoid this issue in production setup, you can use gevent workers, or you can put an nginx infront of gunicorn. Some PaaS already put an nginx in front of your docker container, so you don't have to worry about it. Again the solution depends on the context and the details.

This is a good reading. https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/

benoitc commented 4 years ago

you can check the design page from the documentation. Async workers is one way to run long tasks.

On Sat 8 Aug 2020 at 18:00 leonbrag notifications@github.com wrote:

Is there a reference architecture/design that shows a proper way to set up Gunicorn flask app with long (permanent) worker thread ?

If this is not a bug, then it’s seems an artifact or a limitation of the Gunicorn architecture/design.

Why would not sync worker run forever and accept clients connections. Such worker would close socket as needed, yet continue to run without exIting (and therefor worker thread continue to run).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/1801#issuecomment-670944797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIWRQGIP3R5PMVJ5ENTR7VZA3ANCNFSM4FDLD5PA .

-- Sent from my Mobile

AlejandroRodriguezP commented 4 years ago

web: gunicorn --workers=3 app:app --timeout 200 --log-file -

I fixed my problem by incresing the --timeout

ivictbor commented 3 years ago

See also #1388 for Docker related tmpfs issues.

Oh, thanks a lot Randall, I forgot to add --worker-tmp-dir /dev/shm to gunicorn arguments when I was running gunicorn in Docker.

BTW will 64 Mb be enough for gunicorn cache?

attajutt commented 3 years ago

gunicorn app:app --timeout 1000 Or gunicorn app:app --preload

Worked for me... I prefer timeout one.

ivictbor commented 3 years ago

Strange, I added --worker-tmp-dir /dev/shm but still receiving:

[2020-11-27 21:01:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:17)

To make sure /dev/shm is ramfs I benchmarked it:

image

The params are next:

    command: /bin/bash -c "cd /code/ && pipenv run gunicorn --worker-tmp-dir /dev/shm conf.wsgi:application --bind 0.0.0.0:8022 --workers 5 --worker-connections=1000"

PS: I am using PyPy

@attajutt timeout is nice but you are risking that gunicorn master process will detect hangup in your worker process only after 1000 seconds, and you will miss a lot of requests. Also it will be hard to detect it if only one of several workers will hangup. I would not do 1000 at least.

attajutt commented 3 years ago

@ivictbor thanks for lmk. 1000 is for reference. Nevertheless, I got the app rolling once Its loaded It is running perfectly fine.

Subrata15 commented 3 years ago

I got this error problem too and after several times, I found that the problem is probably caused :

  1. Nginx configuration
  2. Gunicorn/Uwsgi

If you deploy your app in cloud like GAE, that will not surface anything hint error. you can try to surface the error using this case solution : https://stackoverflow.com/questions/38012797/google-app-engine-502-bad-gateway-with-nodejs

If raised 502 bad gateway; probably will have 2 possibilities:

  1. gunicorn isn't running
  2. gunicorn got timeout

complete sulotion explained in here : https://www.datadoghq.com/blog/nginx-502-bad-gateway-errors-gunicorn/

hope that can fix anyone got error in [CRITICAL] WORKER TIMEOUT

coltonbh commented 3 years ago

Adding another possibility for those who find this thread...

This can also be caused by having docker imposed resource constrains that are too low for you web application. For example I had the following constraints:

services:
  web_app:
    image: blah-blah
    deploy:
      resources:
        limits:
          cpus: "0.25"
          memory: 128M

and these were evidently too low for gunicorn so I constantly got the [CRITICAL] WORKER TIMEOUT error until I removed the constraints.

benoitc commented 3 years ago

For gunicorn this resources are perfectly fine. But you indeed need to plane for the number of workers and the resources needed for your application. 128M and 0.25cpu seems really low for a web application written in Python.... generally speaking you need at least 1 core /vcpu and 512MB of RAM as a bare minimum.

On Fri 26 Mar 2021 at 02:14, Colton Hicks @.***> wrote:

Adding another possibility for those who find this thread...

This can also be caused by having docker imposed resource constrains that are too low for you web application. For example I had the following constraints:

services: web_app: image: blah-blah deploy: resources: limits: cpus: "0.25" memory: 128M

and these were evidently too low for gunicorn so I constantly got the [CRITICAL] WORKER TIMEOUT error until I removed the constraints.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/1801#issuecomment-807855647, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRITPZB7BMA6QW7LFNVLTFPNV3ANCNFSM4FDLD5PA .

-- Sent from my Mobile

dplutcho commented 3 years ago

--timeout=1000 worked form me. Issue was a low-cpu resourced GCP machine. It worked fine on my local machine with the default timeout.

ghost commented 3 years ago

gunicorn app:app --timeout 1000

You're great. It was for me the solution. Thanks very much.

ashwath007 commented 8 months ago

gunicorn app:app --timeout 3000 Worked for me ✌️