Closed tilgovi closed 1 month ago
Proposed solution will not help with in case of keep-alive sockets:
see proof in comment: https://github.com/benoitc/gunicorn/issues/1236#issuecomment-554555524
To shutdown keep-alive socket gracefully - we should do ether one of:
or
p.s. also, original issue #1236 affects us without any tornado, just plain gunicorn ... not sure way it was closed to this one
i don't think it's expected by the spec to close gracefully such socket if the request is not started. Client should normally expect to timeout. Do you have any example of a client that bug with the current behavior?
On Tue 19 Jan 2021 at 21:29 vgrebenschikov notifications@github.com wrote:
Proposed solution will not help with in case of keep-alive sockets:
see proof in comment: of #1236 https://github.com/benoitc/gunicorn/issues/1236: #1236 (comment) https://github.com/benoitc/gunicorn/issues/1236#issuecomment-554555524
To shutdown keep-alive socket gracefully: we should do ether one of:
- first respond on request with connect:close (downgrade keepalive to close connection), then close socket
or
- catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/2317#issuecomment-763111940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIT6WYG3HMCBUAR3BJ3S2XTSNANCNFSM4MNUK22A .
-- Sent from my Mobile
i don't think it's expected by the spec to close gracefully such socket if the request is not started. Client should normally expect to timeout. Do you have any example of a client that bug with the current behavior?
How to reproduce problem: (in fact problem easily can be observed on busy production during graceful shutdown)
Run gunicorn with keepalive (trivial app returning some data) $ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app run apache benchmark:
$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/
... Concurrency Level: 20 Time taken for tests: 2.693 seconds Complete requests: 10000 Failed requests: 435
See > 4% failed requests just due to restarted workers (in this case by max-requests)
Run gunicorn without keepalive (same app) $ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app $ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/
... Complete requests: 10000 Failed requests: 0
See no failed requests
Errors in first case caused by closing keep-alive socket by the server (due to max-requests) while next request in progress. gunicorn-20.0.4
As I answered in https://github.com/benoitc/gunicorn/issues/1236#issuecomment-766283684 the design is to send "Connection: close" once the server is shutting down and to wait for the graceful timeout before forcing connection closed. I'm able to reproduce what you see. Based on the distribution of request times, it seems impossible that connections are getting closed during the keepalive timeout, and instead they must be getting closed by the shutdown before they receive "Connection: close".
I'm not able to reproduce this with gevent, but I can reproduce it with eventlet and the threaded worker. I'll see if I can narrow down a bug in either of those.
I've narrowed down the issue in eventlet to be a failing getsockname()
call during graceful shutdown. I'll file a separate bug for that tomorrow and tackle it. I haven't isolated the issue for threaded workers yet, but believe there may be a bug. Thanks for your report @vgrebenschikov.
Alright, I've isolated the bug for the threaded worker, as well. I'll file issues or PRs to fix both tomorrow.
Here's the gist if you want to check my work and try it out before I get a chance to look at it with eyes that have slept:
diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index d531811..ed70dcb 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -220,13 +220,37 @@ class ThreadWorker(base.Worker):
# handle keepalive timeouts
self.murder_keepalived()
- self.tpool.shutdown(False)
- self.poller.close()
-
+ # stop accepting new connections
for s in self.sockets:
+ self.poller.unregister(s)
s.close()
- futures.wait(self.futures, timeout=self.cfg.graceful_timeout)
+ # handle existing connections until the graceful timeout
+ deadline = time.time() + self.cfg.graceful_timeout
+ while self.nr_conns:
+ # notify the arbiter we are alive
+ self.notify()
+
+ timeout = deadline - time.time()
+ if timeout <= 0:
+ break
+
+ # clean up finished requests
+ result = futures.wait(self.futures, timeout=0)
+ for fut in result.done:
+ self.futures.remove(fut)
+
+ # handle keepalive requests
+ events = self.poller.select(1.0)
+ for key, _ in events:
+ callback = key.data
+ callback(key.fileobj)
+
+ # handle keepalive timeouts
+ self.murder_keepalived()
+
+ self.tpool.shutdown(False)
+ self.poller.close()
def finish_request(self, fs):
if fs.cancelled():
@@ -238,7 +262,7 @@ class ThreadWorker(base.Worker):
(keepalive, conn) = fs.result()
# if the connection should be kept alived add it
# to the eventloop and record it
- if keepalive and self.alive:
+ if keepalive:
# flag the socket as non blocked
conn.sock.setblocking(False)
The first change is to ensure that the worker continues to handle keepalive requests in progress while shutting down. The second chunk fixes a bug preventing keepalive requests that started before the shutdown began from handling their next request and receiving the "Connection: close" that they need to shut down gracefully.
Following from #1236, the tornado worker should behave like the other workers.
On graceful shutdown:
I think that the call to
server.stop()
is already stopping it from accepting, but I don't know if it closes the listener socket. I don't see any graceful timeout handling.We should also double checking that calling
server.stop()
prevents open connections from being reused for keep-alive requests.