Closed atarkowska closed 4 years ago
how can I reproduce it?
I am deploying OMERO.web from https://www.openmicroscopy.org/site/support/omero5.2/sysadmins/unix/server-centos7.html this will start server and web application.
Instead of going to the section Running OMERO try the following:
$ cd /home/omero/OMERO.server
$ virtualenv omero-virtualenv --system-site-packages
$ source /home/omero/omero-virtualenv/bin/activate
(omero-virtualenv) $ pip install --upgrade -r share/web/requirements-py27-nginx.txt
(omero-virtualenv) $ pip install gevent
(omero-virtualenv) $ bin/omero admin start
# add public user
(omero-virtualenv) $ bin/omero login -u root -w ROOTPASSWORD
(omero-virtualenv) $ bin/omero group add test --perms='rwr---'
Server: [localhost:4064]
Created session 2013d8ef-7d67-4206-939e-1c5a4e7d1e5c (root@localhost:4064). Idle timeout: 10 min. Current group: system
Added group 'test' (id=53) with permissions rwr--- # <-- GROUP_ID 53
(omero-virtualenv) $ bin/omero user add public Public User GROUP_ID --userpassword secret
(omero-virtualenv) $ bin/omero logout
#test if public can login
(omero-virtualenv) $ bin/omero login -u public -w secret
Server: [localhost:4064]
Created session 568b0065-a4f6-4f2d-bf3e-4b3089775474 (public@localhost:4064). Idle timeout: 10 min. Current group: test
(omero-virtualenv) $ bin/omero logout
(omero-virtualenv) $ bin/omero config set omero.web.public.enabled True
(omero-virtualenv) $ bin/omero config set omero.web.public.password secret
(omero-virtualenv) $ bin/omero config set omero.web.public.server_id 1
(omero-virtualenv) $ bin/omero config set omero.web.public.user public
# start web
(omero-virtualenv) $ bin/omero web start --workers 2 --worker-connections 100 --wsgi-args ' --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log '``
if you wish to start gunicorn directly:
(omero-virtualenv) $ export PYTHONPATH=/home/omero/OMERO.server/lib/python:/home/omero/OMERO.server/lib/python/omeroweb:$PYTHONPATH
(omero-virtualenv) $ gunicorn -D -p /home/omero/OMERO.server/var/django.pid --bind 127.0.0.1:4080 --workers 5 --worker-connections 200 --max-requests 0 --timeout 30 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log omeroweb.wsgi:application
$ ps aux | grep gunicorn
omero 26410 0.0 0.0 122280 12172 ? S 14:48 0:00 /home/omero/omero-virtualenv/bin/python /home/omero/omero-virtualenv/bin/gunicorn -D -p /home/omero/OMERO.server/var/django.pid --bind 127.0.0.1:4080 --workers 5 --worker-connections 200 --max-requests 0 --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log omeroweb.wsgi:application
... and 5 more
Then you would need to import few larger image files http://www.openmicroscopy.org/site/support/omero5.2/users/cli/import.html#command-line-importer:
./importer-cli -s localhost -u public -w secret image.tif
(I tested with 3GB files)
You can brows through files on http://host/webclient/ (once you click on image on the right hand panel you will see download icon )
this will trigger http://host/webgateway/archived_files/download/IMAGE_ID/
Let me know if you need any help with the deployment or ui. I hope that won't be painful :-)
My impression is that this is exactly the same problem as https://github.com/benoitc/gunicorn/issues/588
My impression was that asynchronous workers, have the benefit of using Connection: keep-alive when serving requests. Request should be served indefinitely.
The timeout happens after the response is successfully sent?
yes, download bar is moving forward in the webbrowser
Just to clarify, I have been deplying the same code using fastcgi workers without any issues. I am not really sure how to debug it, as there are no errors anywhere.
@aleksandra-tarkowska I will try your example, So far it seems there is a blocking operation somewhere that prevent the gunicorn arbiter to be notified. Will do an update on monday.
@benoitc I have been playing with that over the weekend and it turned out that I can only reproduce it with the download speed > 20-30MB when downloading 3-4GB files. Is that coincidence?
To simplify testing I set only to use 2 workers and 100 worker connections.
omero 281 2.0 0.0 126380 12212 ? S 11:12 0:00 /usr/bin/python /usr/bin/gunicorn -D -p /tmp/django.pid --bind 127.0.0.1:4080 --workers 2 --worker-connections 100 --max-requests 0 --timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/tmp/gunicorn.log omeroweb.wsgi:application
I have been trying to use strace but I don't see much there. Looks like both workers are working very hard and then they are just killed.
10:30:11 sendto(22, "\177[\213\177J\246epe\212]\253qo\212\206\225s\240\221W\251``\313\254\212jN\276r\203"..., 2621440, 0, NULL, 0) = 2621440
10:30:11 sendto(50, "IceP\1\0\1\0\0\0\373\0\0\0\214\0\0\0:46e571cc-8c4a"..., 251, 0, NULL, 0) = 251
10:30:11 sendto(22, "\212\253HQ\233\215ejf~l\230\225\200\205\201\262\232\216\237\230\260\223m\201z\255\222\301l~\220"..., 2621440, 0, NULL, 0) = 2621440
10:30:11 sendto(50, "IceP\1\0\1\0\0\0\373\0\0\0\215\0\0\0:46e571cc-8c4a"..., 251, 0, NULL, 0) = 251
Once worker reboot almost a minute later
[2016-01-25 10:28:56 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30652/
[2016-01-25 10:28:58 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30654/
[2016-01-25 10:28:58 +0000] [32685] [DEBUG] GET /webgateway/archived_files/download/30653/
[2016-01-25 10:28:59 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30655/
[2016-01-25 10:29:00 +0000] [18299] [DEBUG] GET /webgateway/archived_files/download/30656/
[2016-01-25 10:29:45 +0000] [32685] [DEBUG] GET /webclient/keepalive_ping/
[2016-01-25 10:29:45 +0000] [32685] [DEBUG] Closing connection.
[2016-01-25 10:30:11 +0000] [18291] [CRITICAL] WORKER TIMEOUT (pid:18299)
[2016-01-25 10:30:11 +0000] [18299] [DEBUG] Closing connection.
[2016-01-25 10:30:12 +0000] [32685] [DEBUG] GET /webgateway/archived_files/download/30655/
[2016-01-25 10:30:55 +0000] [1769] [INFO] Booting worker with pid: 1769
Can I also ask if there is any way to prevent async worker from being killed when it handles multiple requests? Basically if download is in progress and another request handled by the same worker will be longer then timeout it will kill worker and all its requests (including download). My feeling is that download is not really killed but itself but by new coming requests that with high speed download is easier to reproduce. If worker process multiple request at the same time and one of these requests times out (because resources are slow) it will kill the entire worker (even other requests are alive and ping arbiter).
sorry, I updated my post
an async worker is able to handle any number of connections. If an operation takes more than 30 second and block the gevent scheduler it will prevent it to notify the arbiter it's still alive. Which I guess is the issue there.
I will try the example this afternoon and have a closer look on the logs anyway.
@benoitc after yet another testing it look that reducing ICE message size from 2.5MB to 1MB may be helpful with higher speed network. When the network is slow, that’s limiting as opposed to a fast network, then the gateway becomes limiting. I was wondering if you could point me which buffer is responsible for that?
Note: proxy_buffering is off
Can you try with --no-sendfile
, please?
I'm wondering if we're not setting the socket back to non-blocking for the syscall and maybe we should when we patch the sendfile call to retry on partial sends.
Maybe large sendfile is locking the worker when the network is fast enough.
@tilgovi thanks for the suggestion. I am afraid there is no difference.
85% [==============================================> ] 2,634,547,200 57.5MB/s in 51s
2016-01-29 08:07:46 (49.5 MB/s) - Connection closed at byte 2634547200. Retrying.
[2016-01-29 08:07:46 +0000] [19686] [CRITICAL] WORKER TIMEOUT (pid:19691)
[2016-01-29 08:07:46 +0000] [19691] [DEBUG] Closing connection.
...
[2016-01-29 08:08:17 +0000] [19691] [WARNING] Worker graceful timeout (pid:19691)
[2016-01-29 08:08:18 +0000] [19691] [DEBUG] Closing connection.
[2016-01-29 08:08:18 +0000] [19691] [DEBUG] Closing connection.
[2016-01-29 08:08:18 +0000] [19691] [INFO] Worker exiting (pid: 19691)
[2016-01-29 08:08:18 +0000] [21541] [INFO] Booting worker with pid: 21541
from the other hand, how to calculate workers and worker processes when dealing with streaming requests? I see that worker spawns 4 processes for each download request. Is that configurable as well some how?
What? Spawning 4 processes ? How do you mean? A worker don't spawn any process. Only the arbiter does in the limit given via the command line or config
Sorry I mean threads
19982 `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1
19989 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20065 | | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20064 | | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20063 | | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20062 | | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
19988 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20069 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20068 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20067 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
20066 | `- python gunicorn -D -p /var/django.pid --bind 127.0.0.1...
well we don't use threads either. I guess gevent is doing something behind. I will check when I'm back from the FOSDEM.
Gevent spawns threads for its thread pool, which can be used explicitly and which is used automatically by default for DNS lookups. DNS lookups are the most common reason to see unexpected threads.
I don't see that explained in the doc, how --max-requests
works with async workers, will worker restart automatically? I am not setting that, but could it potentially kill all threads?
Worker should restart after --max-requests
requests.
26120 [2016-04-05 23:56:38 +0000] [1914] [INFO] Using worker: sync 26121 [2016-04-05 23:56:38 +0000] [1919] [INFO] Booting worker with pid: 1919 26122 [2016-04-05 23:56:38 +0000] [1920] [INFO] Booting worker with pid: 1920 26123 [2016-04-05 23:56:38 +0000] [1925] [INFO] Booting worker with pid: 1925 26124 [2016-04-05 23:56:38 +0000] [1926] [INFO] Booting worker with pid: 1926 26125 [2016-04-05 23:56:38 +0000] [1929] [INFO] Booting worker with pid: 1929 26126 [2016-04-05 23:56:38 +0000] [1914] [DEBUG] 5 workers ... here is http requests, some kind of requests cost several seconds 26158 [2016-04-06 00:16:39 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1920) 26159 [2016-04-06 00:16:39 +0000] [1920] [INFO] Worker exiting (pid: 1920) 26160 [2016-04-06 00:16:39 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1919) 26161 [2016-04-06 00:16:39 +0000] [1919] [INFO] Worker exiting (pid: 1919) 26162 [2016-04-06 00:16:39 +0000] [3546] [INFO] Booting worker with pid: 3546 26163 [2016-04-06 00:16:39 +0000] [3547] [INFO] Booting worker with pid: 3547 26164 [2016-04-06 00:16:40 +0000] [1914] [CRITICAL] WORKER TIMEOUT (pid:1926) 26165 [2016-04-06 00:16:40 +0000] [1926] [INFO] Worker exiting (pid: 1926) 26166 [2016-04-06 00:16:40 +0000] [3554] [INFO] Booting worker with pid: 3554 ... here is http requests, some kind of requests cost several seconds 26172 [2016-04-06 00:24:20 +0000] [1914] [INFO] Handling signal: term 26173 [2016-04-06 00:24:39 +0000] [4536] [DEBUG] Current configuration: 。。。here is configuration information [2016-04-06 00:24:39 +0000] [4536] [INFO] Starting gunicorn 19.4.5 [2016-04-06 00:24:39 +0000] [4536] [ERROR] Connection in use: ('0.0.0.0', 38008) [2016-04-06 00:24:39 +0000] [4536] [ERROR] Retrying in 1 second. [2016-04-06 00:24:40 +0000] [4536] [ERROR] Connection in use: ('0.0.0.0', 38008) It lasts for 10 minutes unitl we manually kill them and restart gunicorn by supervisorctl, during this period, we cann't handle any http request.
import multiprocessing bind = '0.0.0.0:38008' backlog = 2048 workers = multiprocessing.cpu_count() * 2 + 1 worker_class = 'sync' timeout = 10 loglevel = 'debug' reload = False
[program:server_name] command=gunicorn -c op/server.conf server:app autorstart=true autorestart=true directory=/data/server/ redirect_stderr=true
gunicorn: 19.4.5 Flask: 0.10.1
@bigpyer is that not expected with sync workers? just request takes longer then timeout. problem described in this issue is about gevent async workers that shouldn't time out
@bigpyer how long take a request? How can we reproduce it?
@bigpyer is there any phase in the request handling where it is performing a blocking operation (the worker would be "silent") that could be taking more than 10 seconds?
@benoitc some bigger than 10 seconds, i reproduce it by time.sleep(10) in test environment and i have got WORKER TIMEOUT but restart by supervisoctl is correct, so i don't have a way to reproduce it , more logs or other information, please tell me, many thanks!!
@RyPeck yes, a http request block it, some bigger than 10 seconds
@aleksandra-tarkowska it is with "sync" workers, the worker who doesn't notify master more than 10 seconds will be killed
@bigpyer that HTTP request is probably what is causing the worker to be killed by the CRITICAL TIMEOUT
. You may want to set a timeout on the outgoing HTTP request or up the timeout
setting in your configuration.
@RyPeck Yes, i have set a timeout on requests post and handle the timeout exception normally
hrm what the status of it? Is this still true with the latest release? Anything that could help to reproduce?
from flask import Flask, Response
app = Flask(__name__)
@app.route('/large.csv')
def generate_large_csv():
def generate():
for row in range(5000000):
line = []
for col in range(500):
line.append(str(col))
yield ','.join(line) + '\n'
return Response(generate(), mimetype='text/csv')
@benoitc maybe you can use this code to reproduce the issue.
It's not reasonable that gunicorn
timeout the connection when the response is still transferring.
@season89 maybe you can open a new issue. This issue starts with a discussion about problems possibly due to gevent and using gevent workers, not sync workers.
i met the same issue when i was loading the large word vectors (about 2.5G).
[2018-07-06 15:17:18 +0800] [2820] [INFO] Starting gunicorn 19.7.1
[2018-07-06 15:17:18 +0800] [2820] [INFO] Listening at: http://127.0.0.1:4477 (2820)
[2018-07-06 15:17:18 +0800] [2820] [INFO] Using worker: sync
[2018-07-06 15:17:18 +0800] [2838] [INFO] Booting worker with pid: 2838
[2018-07-06 15:17:18 +0800] [2845] [INFO] Booting worker with pid: 2845
Begin to load vocabularies...
Begin to load vocabularies...
Vocabularies have been loaded...
Vocabularies time: 2.74869704246521
Begin to load WORD2VEC...
Vocabularies have been loaded...
Vocabularies time: 2.758488893508911
Begin to load WORD2VEC...
[2018-07-06 15:17:48 +0800] [2820] [CRITICAL] WORKER TIMEOUT (pid:2838)
[2018-07-06 15:17:48 +0800] [2820] [CRITICAL] WORKER TIMEOUT (pid:2845)
[2018-07-06 15:17:48 +0800] [2845] [INFO] Worker exiting (pid: 2845)
[2018-07-06 15:17:48 +0800] [2838] [INFO] Worker exiting (pid: 2838)
[2018-07-06 15:17:48 +0800] [3309] [INFO] Booting worker with pid: 3309
[2018-07-06 15:17:49 +0800] [3317] [INFO] Booting worker with pid: 3317
Begin to load vocabularies...
Begin to load vocabularies...
Vocabularies have been loaded...
Vocabularies time: 2.5433778762817383
Begin to load WORD2VEC...
Vocabularies have been loaded...
Vocabularies time: 2.576946973800659
Begin to load WORD2VEC...
It's not reasonable that gunicorn timeout the connection when the response is still transferring.
The sync worker is not expected to be alive for a long time. If it would be it would prevent other connections to be accepted. That's per design it timeout if it's kept too much time busy. You will eed to use an async worker to achieve it or the threaded with a larger timeout. Another kind of worker will allows that soon as well but it's not ready yet.
any updates on this, i am facing the same issue.
** update I've fixed the issue, it seems like a ram issue. More workers being called on a smaller ram makes gunicorn loose connection. So adjust the workers and worker connections.
closing the discussion since no activity happened since awhile. also sendfile has been improved in 20.0.1. If you still reproduce the issue please open a new ticket :)
just a quick note that i was experiencing something similar with gevent causing [WARNING] Worker graceful timeout
in gunicorn. see https://github.com/abourget/gevent-socketio/issues/147 for workaround
I have Django app using StreamingHttpResponse that fails when gunicorn worker unexpectedly times out. For testing purpose I deployed it on CentOS 7 using nginx (Django-1.8.8, gevent-1.0.2 greenlet-0.4.9), gevent async workers (
proxy_buffering off
):gunicorn log:
nginx log
There are no other errors