django / daphne

Django Channels HTTP/WebSocket server
BSD 3-Clause "New" or "Revised" License
2.4k stars 268 forks source link

Implement Graceful Shutdown / Connection Draining #177

Open agronick opened 6 years ago

agronick commented 6 years ago

This was present in the old version of Channels. The changelog says:

0.9.4 (2016-03-08)

This is no longer the case.

With the new architecture in Channels 2 this ability will need to be moved to Daphne. Daphne can not stop running. It will need some kind of API to reload new code while continuing to service existing connections on the old processes.

andrewgodwin commented 6 years ago

To clarify a bit more - this ticket will just be for graceful shutdown (connection draining), as restarting/reloading is much more complicated and will require us to do things with separate processes, which I am not keen to take on at the moment.

agronick commented 6 years ago

Yeah, I'm not sure what benefit that provides though. Once you stop accepting connections you need something to take it's place. The only way you can do that is with a proxy before Daphne. If the proxy is routing connections to another instance, connection draining would prevent something that wouldn't happen anyways.

Unless I'm missing something and there is a way to bind two processes to a port or socket or something.

andrewgodwin commented 6 years ago

Graceful shutdown is mostly so you can prevent new connections while you close out old ones, which is especially useful for WebSockets, which are more stateful than HTTP.

New Linux kernels do in fact allow you to bind two processes to a port using SO_REUSEPORT (https://lwn.net/Articles/542629/) - it would probably be nice to add support for this into Daphne to get full switchover without a separate loadbalancer.

agronick commented 6 years ago

Oh thats awesome.

andrewgodwin commented 6 years ago

As discussed on #182, SO_REUSEPORT is unfortunately not going to be easy in the short term, so instead we'll have to rely on people using the --fd option with process managers.

As for how restart without losing connections generally, the best way right now would be to use a loadbalancer (e.g. HAProxy) or process manager that supports graceful restarts itself and swap in and out servers as you change them over. Not ideal, I know - it only really works at large scale with automation. Hopefully I'll have time for proper graceful restart soon.

agronick commented 6 years ago

So will Daphne handle SIGINT by exiting after all connections terminate with the current codebase?

andrewgodwin commented 6 years ago

It won't until I implement it, which is why this ticket is still open. Right now it will just hard-exit.

agronick commented 6 years ago

If it did that it seems Circus would work fine. The file descriptor feature appears to work well with Circus.

Edit: After spending some more time with this, the best solution I found was to put HAProxy after Nginx. Its heaver than I would of hoped for, but it allows me to set multiple instances and put them into "drain" mode one by one. It has a web UI, and once an instance is drained I can load the new code and the users don't notice anything.

karolyi commented 6 years ago

+1, subscribing for notifications

acu192 commented 5 years ago

To solve this problem, I started using Uvicorn & Gunicorn (those names make me laugh every time I write them...). Gunicorn can deploy your new code by spinning up new workers for you then gracefully shutting down your old workers, so that you have no down-time. See the HUP signal here. Uvicorn implements ASGI and has a plugin to Gunicorn. See here. I was able to use those as a drop-in-replacement for Daphne (no changes needed to my channels code).

Turns out it has a nice side effect too... a very nice side effect... it's like 10x faster (at least for my deployment; of course your millage may vary). By "faster" I mean my server's CPU usage is much lower now. My server used to sit at ~20% CPU when I ran my "pummel the server" script. Same script, new interface server, CPU barely hits 2%. I rolled back to Daphne just to double check it! It holds.

I'm using Nginx as a proxy in front of Gunicorn. One weird thing I ran into is that if I had Nginx proxy to Gunicorn over a unix socket, I would get a weird exception somewhere deep inside channels (at request-time). If I proxy from Nginx to Gunicorn over TCP, it all works great. So that's where I left it. I didn't look into it further -- just something to be aware of it you try it out.

agronick commented 5 years ago

@acu192 Does the HUP handing actually work for you? I've tried it myself but the HUP signal causes it to reload immediately and drop all of it's websocket connections.

acu192 commented 5 years ago

Yeah it will drop websocket connections, but any "normal" HTTP connections should be drained gracefully before the old workers are shut down (I haven't tested it super-well, but it does seem to work based on some basic experimentation I've done -- I've only had this setup for a few days now). I don't know of a way to not have the websocket connections drop... since it's a long lived TCP connection, if the connection-holding process dies it will have to drop. The only solution I know of would be to let those old workers live a long time to hold open those old websocket connections (I don't want to do that). Or do something like channels 1 did where it had an entirely separate interface server (as its own process) which communicated to the workers via redis (or whatever channel layer). I was never a fan of that though -- channels 2 is way better in my opinion by having the workers be the interface servers as well.

In my case I don't mind if the websocket connections drop. They'll quickly reconnect and the user will never know. As long as the "normal" HTTP connections are all served (i.e. no one sees an error message when loading the page for the first time), then I'm happy in my case.

agronick commented 5 years ago

In my case I really need the websocket connections to drain. I have some pages where it wouldnt matter but we are doing things like web based ssh sessions. Haproxy is the only way I've found to drain websocket connections.

On Fri, Nov 30, 2018, 6:13 PM Ryan Henning notifications@github.com wrote:

Yeah it will drop websocket connections, but any "normal" HTTP connections should be drained gracefully before the old workers are shut down (I haven't tested it super-well, but it does seem to work based on some basic experimentation I've done -- I've only had this setup for a few days now). I don't know of a way to not have the websocket connections drop... since it's a long lived TCP connection, if the connection-holding process dies it will have to drop. The only solution I know of would be to let those old workers live a long time to hold open those old websocket connections (I don't want to do that). Or do something like channels 1 did where it had an entirely separate interface server (as its own process) which communicated to the workers via redis (or whatever channel layer). I was never a fan of that though -- channels 2 is way better in my opinion by having the workers be the interface servers as well.

In my case I don't mind if the websocket connections drop. They'll quickly reconnect and the user will never know. As long as the "normal" HTTP connections are all server (i.e. no one sees an error message when loading the page for the first time), then I'm happy in my case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/django/daphne/issues/177#issuecomment-443368169, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8pv05gjDAUNlFhUqxaetg4Xnp7Rt-Pks5u0buwgaJpZM4SMgya .

karolyi commented 5 years ago

@agronick,

I can understand your problem with having the websockets disconnected, but as I've been told, websocket client connections should be built to withstand disconnections and reconnect/resync gracefully, without letting the user know (being stateless, practically). For the most part, this is done by many websocket clients. I built several ones that aren't even browser based, and every time they reconnect, they either exchange synchronization information with the server, or they assume everything continues as it were happening before. YMMV, but this should be the case most of the time.

Maybe you want to put some extra connection handler into your client/server logic to handle disconnects.

Cheers,

László Károlyi http://linkedin.com/in/karolyi

On 2018-12-01 00:20, Kyle Agronick wrote:

In my case I really need the websocket connections to drain. I have some pages where it wouldnt matter but we are doing things like web based ssh sessions. Haproxy is the only way I've found to drain websocket connections.

On Fri, Nov 30, 2018, 6:13 PM Ryan Henning notifications@github.com wrote:

Yeah it will drop websocket connections, but any "normal" HTTP connections should be drained gracefully before the old workers are shut down (I haven't tested it super-well, but it does seem to work based on some basic experimentation I've done -- I've only had this setup for a few days now). I don't know of a way to not have the websocket connections drop... since it's a long lived TCP connection, if the connection-holding process dies it will have to drop. The only solution I know of would be to let those old workers live a long time to hold open those old websocket connections (I don't want to do that). Or do something like channels 1 did where it had an entirely separate interface server (as its own process) which communicated to the workers via redis (or whatever channel layer). I was never a fan of that though -- channels 2 is way better in my opinion by having the workers be the interface servers as well.

In my case I don't mind if the websocket connections drop. They'll quickly reconnect and the user will never know. As long as the "normal" HTTP connections are all server (i.e. no one sees an error message when loading the page for the first time), then I'm happy in my case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/django/daphne/issues/177#issuecomment-443368169, or mute the thread

https://github.com/notifications/unsubscribe-auth/AB8pv05gjDAUNlFhUqxaetg4Xnp7Rt-Pks5u0buwgaJpZM4SMgya .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/django/daphne/issues/177#issuecomment-443369438, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8Pr3lPEq4qbKXDaDho7jjQ1u2TXB4rks5u0b1SgaJpZM4SMgya.

agronick commented 5 years ago

Were not talking about the user disconnecting and reconnecting. We're talking about the process dying and a new one rebuilding the previous process' state in memory. Some things just can't be serialized and persisted. Other things aren't worth an exponentially larger development effort when connection draining solves the problem fine. Especially sockets. I don't know if there is even a way to hand a socket off from a process that is shutting down to a new process.

andrewgodwin commented 5 years ago

If you find uvicorn works better for you, then please use it! Daphne is a reference server but doesn't have as much active development, so it likely will never beat it.

acu192 commented 5 years ago

@andrewgodwin Thank you for working so hard to build channels! Btw, channels 2 is wonderful. All the changes are well worth breaking the interface from channels 1. It's great to see other project (like uvicorn) adopting the ASGI standard as well. Very well done.

Ken4scholars commented 4 years ago

Seems this ticket was left out. @andrewgodwin @carltongibson any plans in the nearest future to fix this? Thank you

carltongibson commented 4 years ago

@Ken4scholars No immediate plans no. Next priority is updating to be fully ready for Django 3.1 — which mostly involves making Channels ASGI v3 ready, and updating the documentation there.

If you would like to contribute then here is an opportunity!

ben-xo commented 1 year ago

This is something I'd be interested in as well.

cbeaujoin-stellar commented 5 months ago

Any updates ?