Graceful shutdown for tornado worker

tilgovi commented 4 years ago

Following from #1236, the tornado worker should behave like the other workers.

On graceful shutdown:

Stop accepting and close the listening socket
Start a graceful shutdown timer
Force end all connections if there are running requests after the timeout expires

I think that the call to server.stop() is already stopping it from accepting, but I don't know if it closes the listener socket. I don't see any graceful timeout handling.

We should also double checking that calling server.stop() prevents open connections from being reused for keep-alive requests.

vgrebenschikov commented 3 years ago

Proposed solution will not help with in case of keep-alive sockets:

see proof in comment: https://github.com/benoitc/gunicorn/issues/1236#issuecomment-554555524

To shutdown keep-alive socket gracefully - we should do ether one of:

first respond on request with connect:close (downgrade keepalive to close connection), then close socket

or

catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad)

p.s. also, original issue #1236 affects us without any tornado, just plain gunicorn ... not sure way it was closed to this one

benoitc commented 3 years ago

i don't think it's expected by the spec to close gracefully such socket if the request is not started. Client should normally expect to timeout. Do you have any example of a client that bug with the current behavior?

On Tue 19 Jan 2021 at 21:29 vgrebenschikov notifications@github.com wrote:

Proposed solution will not help with in case of keep-alive sockets:

see proof in comment: of #1236 https://github.com/benoitc/gunicorn/issues/1236: #1236 (comment) https://github.com/benoitc/gunicorn/issues/1236#issuecomment-554555524

To shutdown keep-alive socket gracefully: we should do ether one of:

first respond on request with connect:close (downgrade keepalive to close connection), then close socket

or

catch a window when client does not yet send next request (to be in time with closing until client fire next request, otherwise it will be failed, which is bad)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benoitc/gunicorn/issues/2317#issuecomment-763111940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIT6WYG3HMCBUAR3BJ3S2XTSNANCNFSM4MNUK22A .

-- Sent from my Mobile

vgrebenschikov commented 3 years ago

i don't think it's expected by the spec to close gracefully such socket if the request is not started. Client should normally expect to timeout. Do you have any example of a client that bug with the current behavior?

How to reproduce problem: (in fact problem easily can be observed on busy production during graceful shutdown)

Run gunicorn with keepalive (trivial app returning some data) $ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app run apache benchmark:

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

... Concurrency Level: 20 Time taken for tests: 2.693 seconds Complete requests: 10000 Failed requests: 435

See > 4% failed requests just due to restarted workers (in this case by max-requests)

Run gunicorn without keepalive (same app) $ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app $ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

... Complete requests: 10000 Failed requests: 0

See no failed requests

Errors in first case caused by closing keep-alive socket by the server (due to max-requests) while next request in progress. gunicorn-20.0.4

tilgovi commented 3 years ago

As I answered in https://github.com/benoitc/gunicorn/issues/1236#issuecomment-766283684 the design is to send "Connection: close" once the server is shutting down and to wait for the graceful timeout before forcing connection closed. I'm able to reproduce what you see. Based on the distribution of request times, it seems impossible that connections are getting closed during the keepalive timeout, and instead they must be getting closed by the shutdown before they receive "Connection: close".

I'm not able to reproduce this with gevent, but I can reproduce it with eventlet and the threaded worker. I'll see if I can narrow down a bug in either of those.

tilgovi commented 3 years ago

I've narrowed down the issue in eventlet to be a failing getsockname() call during graceful shutdown. I'll file a separate bug for that tomorrow and tackle it. I haven't isolated the issue for threaded workers yet, but believe there may be a bug. Thanks for your report @vgrebenschikov.

tilgovi commented 3 years ago

Alright, I've isolated the bug for the threaded worker, as well. I'll file issues or PRs to fix both tomorrow.

tilgovi commented 3 years ago

Here's the gist if you want to check my work and try it out before I get a chance to look at it with eyes that have slept:

diff --git a/gunicorn/workers/gthread.py b/gunicorn/workers/gthread.py
index d531811..ed70dcb 100644
--- a/gunicorn/workers/gthread.py
+++ b/gunicorn/workers/gthread.py
@@ -220,13 +220,37 @@ class ThreadWorker(base.Worker):
             # handle keepalive timeouts
             self.murder_keepalived()

-        self.tpool.shutdown(False)
-        self.poller.close()
-
+        # stop accepting new connections
         for s in self.sockets:
+            self.poller.unregister(s)
             s.close()

-        futures.wait(self.futures, timeout=self.cfg.graceful_timeout)
+        # handle existing connections until the graceful timeout
+        deadline = time.time() + self.cfg.graceful_timeout
+        while self.nr_conns:
+            # notify the arbiter we are alive
+            self.notify()
+
+            timeout = deadline - time.time()
+            if timeout <= 0:
+                break
+
+            # clean up finished requests
+            result = futures.wait(self.futures, timeout=0)
+            for fut in result.done:
+                self.futures.remove(fut)
+
+            # handle keepalive requests
+            events = self.poller.select(1.0)
+            for key, _ in events:
+                callback = key.data
+                callback(key.fileobj)
+
+            # handle keepalive timeouts
+            self.murder_keepalived()
+
+        self.tpool.shutdown(False)
+        self.poller.close()

     def finish_request(self, fs):
         if fs.cancelled():
@@ -238,7 +262,7 @@ class ThreadWorker(base.Worker):
             (keepalive, conn) = fs.result()
             # if the connection should be kept alived add it
             # to the eventloop and record it
-            if keepalive and self.alive:
+            if keepalive:
                 # flag the socket as non blocked
                 conn.sock.setblocking(False)

The first change is to ensure that the worker continues to handle keepalive requests in progress while shutting down. The second chunk fixes a bug preventing keepalive requests that started before the shutdown began from handling their next request and receiving the "Connection: close" that they need to shut down gracefully.

benoitc / gunicorn

Graceful shutdown for tornado worker #2317