benoitc / gunicorn

gunicorn 'Green Unicorn' is a WSGI HTTP Server for UNIX, fast clients and sleepy applications.
http://www.gunicorn.org
Other
9.75k stars 1.74k forks source link

gunicorn worker unexpected time out when streaming #1186

Closed atarkowska closed 4 years ago

atarkowska commented 8 years ago

I have Django app using StreamingHttpResponse that fails when gunicorn worker unexpectedly times out. For testing purpose I deployed it on CentOS 7 using nginx (Django-1.8.8, gevent-1.0.2 greenlet-0.4.9), gevent async workers (proxy_buffering off):

gunicorn -D -p /path/to/django.pid --bind 127.0.0.1:8000 --workers 2 -k gevent --worker-connections 100 myapp.wsgi:application

gunicorn log:

[2016-01-21 15:12:34 +0000] [30333] [INFO] Listening at: ...
[2016-01-21 15:12:34 +0000] [30333] [INFO] Using worker: gevent
...
[2016-01-21 15:14:22 +0000] [30338] [DEBUG] GET /webclient/keepalive_ping/
[2016-01-21 15:14:22 +0000] [30338] [DEBUG] Closing connection. 
[2016-01-21 15:14:24 +0000] [30343] [DEBUG] GET /webgateway/archived_files/download/1/
...
[2016-01-21 15:15:43 +0000] [30333] [CRITICAL] WORKER TIMEOUT (pid:30343)
[2016-01-21 15:15:43 +0000] [30343] [DEBUG] Closing connection. 
[2016-01-21 15:15:43 +0000] [30343] [INFO] Worker exiting (pid: 30343)
[2016-01-21 15:15:43 +0000] [31203] [INFO] Booting worker with pid: 31203

nginx log

2016/01/21 15:15:43 [error] 23160#0: *10849 upstream prematurely closed connection while reading upstream...

There are no other errors

benoitc commented 8 years ago

how can I reproduce it?

atarkowska commented 8 years ago

I am deploying OMERO.web from https://www.openmicroscopy.org/site/support/omero5.2/sysadmins/unix/server-centos7.html this will start server and web application.

Instead of going to the section Running OMERO try the following:

$ cd /home/omero/OMERO.server
$ virtualenv omero-virtualenv --system-site-packages
$ source /home/omero/omero-virtualenv/bin/activate 
(omero-virtualenv) $ pip install --upgrade -r share/web/requirements-py27-nginx.txt 
(omero-virtualenv) $ pip install gevent
(omero-virtualenv) $ bin/omero admin start

# add public user
(omero-virtualenv) $ bin/omero login -u root -w ROOTPASSWORD
(omero-virtualenv) $ bin/omero group add test --perms='rwr---'  
Server: [localhost:4064]
Created session 2013d8ef-7d67-4206-939e-1c5a4e7d1e5c (root@localhost:4064). Idle timeout: 10 min. Current group: system
Added group 'test' (id=53) with permissions rwr---   # <-- GROUP_ID 53
(omero-virtualenv) $ bin/omero user add public Public User GROUP_ID --userpassword secret
(omero-virtualenv) $ bin/omero logout

#test if public can login
(omero-virtualenv) $ bin/omero login -u public -w secret
Server: [localhost:4064]
Created session 568b0065-a4f6-4f2d-bf3e-4b3089775474 (public@localhost:4064). Idle timeout: 10 min. Current group: test
(omero-virtualenv) $ bin/omero logout

(omero-virtualenv) $ bin/omero config set omero.web.public.enabled True
(omero-virtualenv) $ bin/omero config set omero.web.public.password secret
(omero-virtualenv) $ bin/omero config set omero.web.public.server_id 1
(omero-virtualenv) $ bin/omero config set omero.web.public.user public

# start web
(omero-virtualenv) $ bin/omero web start --workers 2 --worker-connections 100 --wsgi-args ' --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log '``

if you wish to start gunicorn directly:

(omero-virtualenv) $ export PYTHONPATH=/home/omero/OMERO.server/lib/python:/home/omero/OMERO.server/lib/python/omeroweb:$PYTHONPATH
(omero-virtualenv) $ gunicorn -D -p /home/omero/OMERO.server/var/django.pid --bind 127.0.0.1:4080 --workers 5 --worker-connections 200 --max-requests 0 --timeout 30 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log omeroweb.wsgi:application
$ ps aux | grep gunicorn
omero     26410  0.0  0.0 122280 12172 ?        S    14:48   0:00 /home/omero/omero-virtualenv/bin/python /home/omero/omero-virtualenv/bin/gunicorn -D -p /home/omero/OMERO.server/var/django.pid --bind 127.0.0.1:4080 --workers 5 --worker-connections 200 --max-requests 0 --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log omeroweb.wsgi:application
... and 5 more

Then you would need to import few larger image files http://www.openmicroscopy.org/site/support/omero5.2/users/cli/import.html#command-line-importer:

./importer-cli -s localhost -u public -w secret image.tif (I tested with 3GB files)

You can brows through files on http://host/webclient/ (once you click on image on the right hand panel you will see download icon icon_download2 ) this will trigger http://host/webgateway/archived_files/download/IMAGE_ID/

Let me know if you need any help with the deployment or ui. I hope that won't be painful :-)

My impression is that this is exactly the same problem as https://github.com/benoitc/gunicorn/issues/588

atarkowska commented 8 years ago

My impression was that asynchronous workers, have the benefit of using Connection: keep-alive when serving requests. Request should be served indefinitely.

tilgovi commented 8 years ago

The timeout happens after the response is successfully sent?

atarkowska commented 8 years ago

yes, download bar is moving forward in the webbrowser

atarkowska commented 8 years ago

Just to clarify, I have been deplying the same code using fastcgi workers without any issues. I am not really sure how to debug it, as there are no errors anywhere.

benoitc commented 8 years ago

@aleksandra-tarkowska I will try your example, So far it seems there is a blocking operation somewhere that prevent the gunicorn arbiter to be notified. Will do an update on monday.

atarkowska commented 8 years ago

@benoitc I have been playing with that over the weekend and it turned out that I can only reproduce it with the download speed > 20-30MB when downloading 3-4GB files. Is that coincidence?

To simplify testing I set only to use 2 workers and 100 worker connections.

omero       281  2.0  0.0 126380 12212 ?        S    11:12   0:00 /usr/bin/python /usr/bin/gunicorn -D -p /tmp/django.pid --bind 127.0.0.1:4080 --workers 2 --worker-connections 100 --max-requests 0 --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/tmp/gunicorn.log omeroweb.wsgi:application

I have been trying to use strace but I don't see much there. Looks like both workers are working very hard and then they are just killed.

10:30:11 sendto(22, "\177[\213\177J\246epe\212]\253qo\212\206\225s\240\221W\251``\313\254\212jN\276r\203"..., 2621440, 0, NULL, 0) = 2621440
10:30:11 sendto(50, "IceP\1\0\1\0\0\0\373\0\0\0\214\0\0\0:46e571cc-8c4a"..., 251, 0, NULL, 0) = 251
10:30:11 sendto(22, "\212\253HQ\233\215ejf~l\230\225\200\205\201\262\232\216\237\230\260\223m\201z\255\222\301l~\220"..., 2621440, 0, NULL, 0) = 2621440
10:30:11 sendto(50, "IceP\1\0\1\0\0\0\373\0\0\0\215\0\0\0:46e571cc-8c4a"..., 251, 0, NULL, 0) = 251

Once worker reboot almost a minute later

[2016-01-25 10:28:56 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30652/
[2016-01-25 10:28:58 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30654/
[2016-01-25 10:28:58 +0000] [32685] [DEBUG] GET /webgateway/archived_files/download/30653/
[2016-01-25 10:28:59 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30655/
[2016-01-25 10:29:00 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30656/
[2016-01-25 10:29:45 +0000] [32685] [DEBUG] GET /webclient/keepalive_ping/
[2016-01-25 10:29:45 +0000] [32685] [DEBUG] Closing connection. 
[2016-01-25 10:30:11 +0000] [18291] [CRITICAL] WORKER TIMEOUT (pid:18299)
[2016-01-25 10:30:11 +0000] [18299] [DEBUG] Closing connection. 
[2016-01-25 10:30:12 +0000] [32685] [DEBUG] GET /webgateway/archived_files/download/30655/
[2016-01-25 10:30:55 +0000] [1769] [INFO] Booting worker with pid: 1769

Can I also ask if there is any way to prevent async worker from being killed when it handles multiple requests? Basically if download is in progress and another request handled by the same worker will be longer then timeout it will kill worker and all its requests (including download). My feeling is that download is not really killed but itself but by new coming requests that with high speed download is easier to reproduce. If worker process multiple request at the same time and one of these requests times out (because resources are slow) it will kill the entire worker (even other requests are alive and ping arbiter).

sorry, I updated my post

benoitc commented 8 years ago

an async worker is able to handle any number of connections. If an operation takes more than 30 second and block the gevent scheduler it will prevent it to notify the arbiter it's still alive. Which I guess is the issue there.

I will try the example this afternoon and have a closer look on the logs anyway.

atarkowska commented 8 years ago

@benoitc after yet another testing it look that reducing ICE message size from 2.5MB to 1MB may be helpful with higher speed network. When the network is slow, that’s limiting as opposed to a fast network, then the gateway becomes limiting. I was wondering if you could point me which buffer is responsible for that?

Note: proxy_buffering is off

tilgovi commented 8 years ago

Can you try with --no-sendfile, please?

I'm wondering if we're not setting the socket back to non-blocking for the syscall and maybe we should when we patch the sendfile call to retry on partial sends.

Maybe large sendfile is locking the worker when the network is fast enough.

atarkowska commented 8 years ago

@tilgovi thanks for the suggestion. I am afraid there is no difference.

85% [==============================================>         ] 2,634,547,200 57.5MB/s   in 51s    
2016-01-29 08:07:46 (49.5 MB/s) - Connection closed at byte 2634547200. Retrying.
[2016-01-29 08:07:46 +0000] [19686] [CRITICAL] WORKER TIMEOUT (pid:19691)
[2016-01-29 08:07:46 +0000] [19691] [DEBUG] Closing connection. 
...
[2016-01-29 08:08:17 +0000] [19691] [WARNING] Worker graceful timeout (pid:19691)
[2016-01-29 08:08:18 +0000] [19691] [DEBUG] Closing connection. 
[2016-01-29 08:08:18 +0000] [19691] [DEBUG] Closing connection. 
[2016-01-29 08:08:18 +0000] [19691] [INFO] Worker exiting (pid: 19691)
[2016-01-29 08:08:18 +0000] [21541] [INFO] Booting worker with pid: 21541
atarkowska commented 8 years ago

from the other hand, how to calculate workers and worker processes when dealing with streaming requests? I see that worker spawns 4 processes for each download request. Is that configurable as well some how?

benoitc commented 8 years ago

What? Spawning 4 processes ? How do you mean? A worker don't spawn any process. Only the arbiter does in the limit given via the command line or config

atarkowska commented 8 years ago

Sorry I mean threads

  19982 `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1
  19989 |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20065 |  |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20064 |  |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20063 |  |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20062 |  |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  19988 |  `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20069 |     `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20068 |     `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20067 |     `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
  20066 |     `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
benoitc commented 8 years ago

well we don't use threads either. I guess gevent is doing something behind. I will check when I'm back from the FOSDEM.

jamadden commented 8 years ago

Gevent spawns threads for its thread pool, which can be used explicitly and which is used automatically by default for DNS lookups. DNS lookups are the most common reason to see unexpected threads.

atarkowska commented 8 years ago

I don't see that explained in the doc, how --max-requests works with async workers, will worker restart automatically? I am not setting that, but could it potentially kill all threads?

tilgovi commented 8 years ago

Worker should restart after --max-requests requests.

niubell commented 8 years ago

I have a flask app deployed by gunicorn and supervisor and i got errors like this

26120 [2016-04-05 23:56:38 +0000] [1914] [INFO] Using worker: sync 26121 [2016-04-05 23:56:38 +0000] [1919] [INFO] Booting worker with pid: 1919 26122 [2016-04-05 23:56:38 +0000] [1920] [INFO] Booting worker with pid: 1920 26123 [2016-04-05 23:56:38 +0000] [1925] [INFO] Booting worker with pid: 1925 26124 [2016-04-05 23:56:38 +0000] [1926] [INFO] Booting worker with pid: 1926 26125 [2016-04-05 23:56:38 +0000] [1929] [INFO] Booting worker with pid: 1929 26126 [2016-04-05 23:56:38 +0000] [1914] [DEBUG] 5 workers ... here is http requests, some kind of requests cost several seconds 26158 [2016-04-06 00:16:39 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1920) 26159 [2016-04-06 00:16:39 +0000] [1920] [INFO] Worker exiting (pid: 1920) 26160 [2016-04-06 00:16:39 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1919) 26161 [2016-04-06 00:16:39 +0000] [1919] [INFO] Worker exiting (pid: 1919) 26162 [2016-04-06 00:16:39 +0000] [3546] [INFO] Booting worker with pid: 3546 26163 [2016-04-06 00:16:39 +0000] [3547] [INFO] Booting worker with pid: 3547 26164 [2016-04-06 00:16:40 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1926) 26165 [2016-04-06 00:16:40 +0000] [1926] [INFO] Worker exiting (pid: 1926) 26166 [2016-04-06 00:16:40 +0000] [3554] [INFO] Booting worker with pid: 3554 ... here is http requests, some kind of requests cost several seconds 26172 [2016-04-06 00:24:20 +0000] [1914] [INFO] Handling signal: term 26173 [2016-04-06 00:24:39 +0000] [4536] [DEBUG] Current configuration: 。。。here is configuration information [2016-04-06 00:24:39 +0000] [4536] [INFO] Starting gunicorn 19.4.5 [2016-04-06 00:24:39 +0000] [4536] [ERROR] Connection in use: ('0.0.0.0', 38008) [2016-04-06 00:24:39 +0000] [4536] [ERROR] Retrying in 1 second. [2016-04-06 00:24:40 +0000] [4536] [ERROR] Connection in use: ('0.0.0.0', 38008) It lasts for 10 minutes unitl we manually kill them and restart gunicorn by supervisorctl, during this period, we cann't handle any http request.

the gunicorn configuration:

import multiprocessing bind = '0.0.0.0:38008' backlog = 2048 workers = multiprocessing.cpu_count() * 2 + 1 worker_class = 'sync' timeout = 10 loglevel = 'debug' reload = False

the supervisorctl configuration:

[program:server_name] command=gunicorn -c op/server.conf server:app autorstart=true autorestart=true directory=/data/server/ redirect_stderr=true

the software version:

gunicorn: 19.4.5 Flask: 0.10.1

atarkowska commented 8 years ago

@bigpyer is that not expected with sync workers? just request takes longer then timeout. problem described in this issue is about gevent async workers that shouldn't time out

benoitc commented 8 years ago

@bigpyer how long take a request? How can we reproduce it?

RyPeck commented 8 years ago

@bigpyer is there any phase in the request handling where it is performing a blocking operation (the worker would be "silent") that could be taking more than 10 seconds?

niubell commented 8 years ago

@benoitc some bigger than 10 seconds, i reproduce it by time.sleep(10) in test environment and i have got WORKER TIMEOUT but restart by supervisoctl is correct, so i don't have a way to reproduce it , more logs or other information, please tell me, many thanks!!

niubell commented 8 years ago

@RyPeck yes, a http request block it, some bigger than 10 seconds

niubell commented 8 years ago

@aleksandra-tarkowska it is with "sync" workers, the worker who doesn't notify master more than 10 seconds will be killed

RyPeck commented 8 years ago

@bigpyer that HTTP request is probably what is causing the worker to be killed by the CRITICAL TIMEOUT. You may want to set a timeout on the outgoing HTTP request or up the timeout setting in your configuration.

niubell commented 8 years ago

@RyPeck Yes, i have set a timeout on requests post and handle the timeout exception normally

benoitc commented 8 years ago

hrm what the status of it? Is this still true with the latest release? Anything that could help to reproduce?

nihao commented 7 years ago
from flask import Flask, Response

app = Flask(__name__)

@app.route('/large.csv')
def generate_large_csv():
    def generate():
        for row in range(5000000):
            line = []
            for col in range(500):
                line.append(str(col))
            yield ','.join(line) + '\n'
    return Response(generate(), mimetype='text/csv')

@benoitc maybe you can use this code to reproduce the issue.

It's not reasonable that gunicorn timeout the connection when the response is still transferring.

tilgovi commented 7 years ago

@season89 maybe you can open a new issue. This issue starts with a discussion about problems possibly due to gevent and using gevent workers, not sync workers.

hscspring commented 6 years ago

i met the same issue when i was loading the large word vectors (about 2.5G).

[2018-07-06 15:17:18 +0800] [2820] [INFO] Starting gunicorn 19.7.1
[2018-07-06 15:17:18 +0800] [2820] [INFO] Listening at: http://127.0.0.1:4477 (2820)
[2018-07-06 15:17:18 +0800] [2820] [INFO] Using worker: sync
[2018-07-06 15:17:18 +0800] [2838] [INFO] Booting worker with pid: 2838
[2018-07-06 15:17:18 +0800] [2845] [INFO] Booting worker with pid: 2845
Begin to load vocabularies...
Begin to load vocabularies...
Vocabularies have been loaded...
Vocabularies time:  2.74869704246521
Begin to load WORD2VEC...
Vocabularies have been loaded...
Vocabularies time:  2.758488893508911
Begin to load WORD2VEC...
[2018-07-06 15:17:48 +0800] [2820] [CRITICAL] WORKER TIMEOUT (pid:2838)
[2018-07-06 15:17:48 +0800] [2820] [CRITICAL] WORKER TIMEOUT (pid:2845)
[2018-07-06 15:17:48 +0800] [2845] [INFO] Worker exiting (pid: 2845)
[2018-07-06 15:17:48 +0800] [2838] [INFO] Worker exiting (pid: 2838)
[2018-07-06 15:17:48 +0800] [3309] [INFO] Booting worker with pid: 3309
[2018-07-06 15:17:49 +0800] [3317] [INFO] Booting worker with pid: 3317
Begin to load vocabularies...
Begin to load vocabularies...
Vocabularies have been loaded...
Vocabularies time:  2.5433778762817383
Begin to load WORD2VEC...
Vocabularies have been loaded...
Vocabularies time:  2.576946973800659
Begin to load WORD2VEC...
benoitc commented 6 years ago

It's not reasonable that gunicorn timeout the connection when the response is still transferring.

The sync worker is not expected to be alive for a long time. If it would be it would prevent other connections to be accepted. That's per design it timeout if it's kept too much time busy. You will eed to use an async worker to achieve it or the threaded with a larger timeout. Another kind of worker will allows that soon as well but it's not ready yet.

kodeine commented 5 years ago

any updates on this, i am facing the same issue.

** update I've fixed the issue, it seems like a ram issue. More workers being called on a smaller ram makes gunicorn loose connection. So adjust the workers and worker connections.

benoitc commented 4 years ago

closing the discussion since no activity happened since awhile. also sendfile has been improved in 20.0.1. If you still reproduce the issue please open a new ticket :)

fabswt commented 7 months ago

just a quick note that i was experiencing something similar with gevent causing [WARNING] Worker graceful timeout in gunicorn. see https://github.com/abourget/gevent-socketio/issues/147 for workaround