RFC: Hot restart across hot restart versions with SO_REUSEPORT

bplotnick commented 6 years ago

Title: Hot restart across hot restart versions with SO_REUSEPORT

Description: I'd like to collect some feedback on a way to do hot restarting across hot restart versions.

Problem

There is no standard way to upgrade when the hot restart version changes (or if there is, I am not aware of it). The current recommendation is for "operations to take cope with this and do a full restart".

Barring somehow making hot restart data-structures backwards compatible forever, the only way to do this would be something like a prolonged cluster drain for upgrades. This takes a considerable amount of effort and time and may be impractical for cases where you do not have elastic infrastructure.

Proposed solution

For systems that support it, we can use SO_REUSEPORT on the listener sockets for a "less hot restart". We'd have a second Envoy process start up using a different shared memory region (base-id). We'd lose stats, but this would be no different than the current solution of doing a full restart.

We would also need some way to shut down the parent process, which is done via shared memory/RPC right now. This could be done either with a wrapper coordinating the shutdown (e.g. having the hot restart wrapper take on a more active role in the restart process) or by telling the new process to shutdown the old process.

In the latter case, this can't be done the current way since that relies on the RPC mechanism. One option would be to have a simplified core for RPC that never changes that enables this. Another option would be to pass the PID in and have Envoy send a signal when it is ready to shutdown the parent process.

In either case, we'd have to enable/disable this behavior depending on the availability of SO_REUSEPORT.

Problem with SO_REUSEPORT

(This is mostly a rehash of issues discussed here: https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

There is a problem with SO_REUSEPORT, which is that there are race conditions that exist that may cause traffic to be dropped. Specifically there is a race where a connection is put in the accept queue of the old process before it calls close.

We can either accept the fact that these cases are infrequent and that when these reloads happen there may be traffic dropped, or we could implement one of a few different mitigating solutions. These include (but are not limited to):

qdisc dance in supervisor process (e.g. hot-reloader) a la https://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html
add a proxy and use unix domain sockets with atomic move semantics a la https://engineeringblog.yelp.com/2017/05/taking-zero-downtime-load-balancing-even-further.html
use eBPF filters to redirect SYNs to the new process’s listen queue. This was tried and failed in this thread, but maybe someone smart can come along and do this correctly?
reworking the SO_REUSEPORT_LISTEN_OFF kernel patch from the previously linked thread and attempting to get this merged

Note: This problem apparently is Linux specific. I believe systems like OSX will send new connections to the last-bound socket, which will be the newest Envoy instance. So the fact that most of these solutions are Linux specific is probably not an issue.

Alternative solutions

There are alternatives. One possibility, as I alluded to above, is to freeze the API for socket fd passing forever. We would have some simplified IPC mechanism just for FD passing that never changes and doesn’t depend on things like stats area size. I’m not sure the feasibility of this, but it definitely feels like it would be simplest in implementation.

Another possibility is to use a “socket server” as described in this post. The downside of this is that it is another external process to coordinate. The upside is that it provides a level of separation of concerns

So what do people think? Are there any solutions that aren't discussed here?

htuch commented 6 years ago

I would probably prefer that we switch hot restart IPC over to proto3 and use standard proto backwards compat rules, as done elsewhere in the data plane APIs. This should eliminate incompatible upgrades (or make them very rare). Hot restart is already pretty complicated (and not used by some large Envoy deployments such as at Google), so anything we can do to simplify and limit the amount of special cases is +1 from our perspective.

mattklein123 commented 6 years ago

The IPC protocol isn't the issue main issue IMO, as I don't think that has changed in probably 12-18 months and might never need to change again. There are many other things that can effect compatibility such as shared memory layout. I'm not sure how much proto3 will really help here and whether it's worth the churn (though can't hurt).

The first thing that I would say is that I'm reticent to make any "forever" guarantees about anything right now. I just don't think that's reasonable given where we are in Envoy's lifetime. (See also the related discussion about filter API back compat here https://github.com/envoyproxy/envoy/issues/3390).

As @htuch mentioned, hot restart is already very complicated, and the thought of introducing a "light" version doesn't thrill me either. With that said, I understand the concern, and of the options you mention, doing a "light" version that just does socket passing, potentially using a proto3 API, seems the most reasonable to me. Obviously lots of details to think through.

ggreenway commented 6 years ago

I think this is an interesting idea. I also like the "light" hot-restart option the best. I don't know that we'd have to guarantee the protocol is stable forever, we just need implement both the old and new protocol for some period of time when it changes. And as noted, it may never change. I think this is straightforward enough; it's much simpler than trying to make data-structures in shared-memory backwards-compatible. And this approach wouldn't be limited to linux, or suffer from any linux-only weird behaviors.

alyssawilk commented 6 years ago

I have no attachment to our internal design, but if a data point is helpful we've not found the SO_REUSEPORT race problematic. When we hot-swap servers, both listen for some time, the old one stops listening and closes the listening socket (but continues to handle existing connections), and eventually shuts down. This has been good enough for Google-level reliability so I think is plausible if we don't want to do fd handoff.

That said, fast listener handoff (either explicit fd handoff, or serverB starting listening as close to serverA stopping listening as possible) is really the only viable option for QUIC, because the listening socket is unfortunately the same as the data-bearing socket. With QUIC our strategy has been for the old server to maintain the listening socket but send GOAWAYs on all active connections to attempt to frontend-drain QUIC connections to other proxies, and then at the end of the lame duck period. Then A stops listening, B starts listening, and whatever packets are lost in transition are minimal compared to packet loss on the internet (but any lingering connections to A are functionally force-killed)

ggreenway commented 6 years ago

This is just a brainstorming thought, and may not be a good idea, but:

We could use connected UDP sockets to ensure that packets on old "connections" go to the old envoy, and new ones go to the new envoy.

alyssawilk commented 6 years ago

You can try, but it turns out that pretty much no one uses connected sockets at scale - the two kernel devs who did a majority of our UDP CPU improvements strongly advised we stay the heck away from them :-(

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

envoyproxy / envoy