Make graceful shut-down keep-alive behavior consistent

tilgovi commented 8 years ago

Following on from #922, the handling of keep-alive connections during graceful shutdown is not really specified anywhere and may not be consistent among workers.

[x] Describe the intended behavior
[x] Take stock of current behavior and make issues for each worker
[x] Ship it

benoitc commented 8 years ago

@tilgovi do you need anything from me there?

tilgovi commented 8 years ago

If you want to "describe the intended behavior" that would be helpful, otherwise I'll propose it.

benoitc commented 8 years ago

Sorry I missed your answer.

Graceful shutdown only means we let a time for the requests to finish.

1) when the signal is received, the workers stops to accept any new connections 2) At the graceful time, all still running client connections are closed (sockets are closed), kept alived or not.

Speaking of keepalive connections I think we should stop the request loop when the signal is received instead of accepting any new requests. Thoughts?

tuukkamustonen commented 7 years ago

At the graceful time, all still running client connections are closed (sockets are closed), kept alived or not.

@benoitc Considering a sync worker, without threads, I believe no other connections are aborted/lost than the current request in-process, because connections are queued at the master process and not at the worker process level?

If so, can you clarify what you mean by "all still running client connections are closed" - I assume you refer to threaded/async workers here (where multiple requests may be processed concurrently, compared to sync worker without threads)?

benoitc commented 6 years ago

@tuukkamustonen master doesn’t queue any connection. each workers is responsible to accept a connection. afaik connections are queued at the system level. When the master receive the hup signal it notify the worker about it and they stop to accept new connections. Then running connections (those already accepted) will have the graceful time to finish or be forcefully closed.

tuukkamustonen commented 6 years ago

afaik connections are queued at the system level

Ah, I wonder how that works / where it's instructed... well, no need to go that deep :)

When the master receive the HUP signal it notify the worker about it and they stop to accept new connections. Then running connections (those already accepted) will have the graceful time to finish or be forcefully closed.

Ok. This summarizes it nicely.

benoitc commented 6 years ago

@tilgovi we probably should close that issue?

tilgovi commented 6 years ago

I would like to keep this one open. I'm not convinced we have consistent behavior here yet.

vgrebenschikov commented 4 years ago

How to reproduce problem: (in fact problem easily can be observed on busy production during graceful shutdown)

Run gunicorn with keepalive (trivial app returning some data) $ gunicorn --max-requests 512 --keep-alive 2 --threads 20 --workers 4 t:app

run apache benchmark:

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

... Concurrency Level: 20 Time taken for tests: 2.693 seconds Complete requests: 10000 Failed requests: 435

See > 4% failed requests just due to restarted workers (in this case by max-requests)

Run gunicorn without keepalive (same app) $ gunicorn --max-requests 512 --keep-alive 0 --threads 20 --workers 4 t:app

$ ab -n 10000 -c 20 -s 1 -k -r 127.0.0.1:8000/

... Complete requests: 10000 Failed requests: 0

See no failed requests

tried on gunicorn up to 20.0.0

vgrebenschikov commented 4 years ago

Probably it worth resolve problem in a following way:

In case of graceful shutdown on keep-alive connection try to serve one more request after graceful shutdown request and send Connection: close in response to force sender not use this socket any more for next request, if no request arrived in reasonable timeframe (i.e. 1s) just close connection.

Yes, there is small possibility for race (when server decides to close when client sends request), but this will completely close window for problems on heavy load when requests are followed one by one.

benoitc commented 4 years ago

cc @tilgovi ^^

In fact there is two schools there imo

we consider the connection is already living so we could accept one or more quest inside during the graceful time (what you're describing). Ie we keep open the connections as long as we receive a request on them and we are in the graceful time.
if HUP or USR2 is sent to gunicorn, it means we want to change as fast as possible the configuration or the application version. In such case the gracefultime is here so we make sure we terminate the current requests cleanly (finishing a transaction, etc...). But we don't want any new request on the old version. In such case it make sense to also not accept new requests on keepalived connections and close them once the current request terminate.

I'm in favour of 2 which may be more safe. Thoughts?