Closed rosenhouse closed 4 days ago
cc @rshriram
Yeah I think we can probably add listener drain configuration to better determine what happens when a listener enters drain mode. I.e., an option can be to just start refusing connections immediately when draining starts.
In Approach 2. above, typically a different signal is used for a graceful stop, such as SIGHUP. SIGTERM well-known behavior should not be altered.
Note also in the observations for 2. above, that keeping the listener alive after SIGTERM is clearly erroneous, a bug in its own right; that is unrelated to the desire for a TCP drain functionality.
That may be a formatting oversight in the issue. I'm pretty sure Envoy just quits on SIGTERM. Let me fix that in the issue description.
Now, Envoy support graceful shutdown?
To sum up my research to date, and of the three pathways above aught to work, but none will, due to the underlying libevent listener implementation in 2.1.8 released Feb of '17.
It appears deliberate that disabling a listener simply removes the socket from the poll queue, collecting connections against the backq, until then listener is re-enabled.
Work on the listener implementation has been merged to the 2.2.0 dev master, so I am working on a minimal repro case to discuss as a libevent ticket, solicit workarounds effective for 2.0/2.1/2.2, and determine whether this represents a defect or enhancement.
Can you elaborate on what issues you're having with libevent?
I'm working on implementing something similar right now. I'm trying to solve a slightly different problem, but it has a lot of overlap. But I have envoy closing its listener (as reported by netstat).
See #3307. Would that resolve this issue for you?
@ggreenway Yes, thank you.
In the future, we may want a means for certain listeners to opt-out of this behavior. For example, in a sidecar proxy scenario, during the shutdown & drain time, we'd want to close the ingress listeners but leave any egress listeners open until the very end. But we can wait on that enhancement. For now, #3307 would be great.
@ggreenway @rosenhouse note that we already support selective per-listener drain options via https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/lds.proto#L121 (and use it for same reasons at Lyft). @ggreenway please also consider ^ when you are looking into the final solution.
@rosenhouse For clarity, in your ideal scenario, can you write a timeline (or ordered list) of when ingress and egress listeners would be drained and closed?
@emalm does that look right?
Thanks
Thanks, @rosenhouse, that sequence looks right to me!
Not sure if this is important to you or not, but 1 caveat to keep in mind: if you have no way to tell your clients to stop connecting before you close listeners (to cause TCP clients to see connection refused), the listener may have some connections in the listen() backlog when the listener is closed. There is no race-free way to avoid this with the socket API. Those connections will get RST, but they may have already sent some request data.
@ggreenway interesting, I wasn't aware of that. Do you have any references you could point to that describe this?
The documentation is scarce, and results may be OS-dependent. I did testing on linux, using a simple test client and server application. The server calls listen() but never calls accept(). The client connects (using blocking socket calls), then sends some bytes. tcpdump shows the client SYN, server SYN/ACK, client ACK, and the client sending data, and the server ACK'ing the data.
The reason is that, at least on linux, accept() will always return a fully-established connection socket, not a half-open socket. So to meet that requirement, the server OS must tell the client that is is connected. Once it is connected, the client may fill up at least 1 TCP Window worth of data.
@ggreenway I see we've been mulling the same issue.
I'm also convinced libevent is wrong; http://www.wangafu.net/~nickm/libevent-book/Ref8_listener.html Enabling and disabling an evconnlistener int evconnlistener_disable(struct evconnlistener lev); int evconnlistener_enable(struct evconnlistener lev); These functions temporarily disable or reenable listening for new connections.'
They do no such thing, of course; the listener remains open accepting connections, while the event loop stops accepting them during the disabled period.
@rosenhouse Greg's answer above is good, James Briggs calls out how this can be accomplished with iptables. http://www.jebriggs.com/blog/2016/10/linux-graceful-service-shutdown-techniques/
Stevens Unix Network Programming vol 1 section 4.5 goes into quite a bit of detail across the variety of network stack implementations, section 30.7 goes into the two queue lengths. Linux does not implement changing the backlog queue length to 0 (queueing incoming SYNs to later be dropped).
I'm beginning to look at whether packet filtering might help us here to achieve our choice of dropping the SYN on the floor, or ACK+RST the syn request.
@wrowe technically I understand why all of this is being discussed, but practically, it's really hard for me to understand why this makes operational sense for any deployment. This is exactly why Envoy supports health check draining. In what situation are you not able to drain traffic via failing health checks of some type?
I understand your point about libevent. I imagine that was written from the point of view of a library that provides events for things (and also happens to manage sockets sometimes). I think it could be rephrased more correctly as "These functions temporarily disable or reenable listening for events for new connections." But I think that's beside the point, because the socket API doesn't provide for what you want to do (at least on linux). As you pointed out, you'll probably need to do some kind of packet filtering first.
If you're going down the road of packet filtering, you may not need any change to envoy. You can add a rule to drop all SYNs, wait until connections close or you've waited as long as you want to, then kill envoy.
@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?
But in the absence of /healthcheck/fail
(like for TcpProxy), you need some way external to envoy to drain traffic. IPTables is one such way.
@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?
In every deployment I've ever done it does. Basically, HTTP health check can be used to fail an Envoy out of some rotation, whether that be a load balancer, DNS, BGP, etc. Then after some period of time the process can be terminated/restarted/etc. I've just never seen a need for this type of control so it's kind of surprising to me.
Yeah, I see what you mean. I meant specifically /healthcheck/fail
. But yes, there must be some other mechanism for doing the same thing. I think you and I are in agreement here.
I'm having trouble following this thread starting from this comment. Up to that comment, the conversation has been around making the current linux socket api drain in-flight connections (i.e. that have finished the 3-way handshake but aren't accepted yet by the application). @mattklein123 can you explain to me what do you mean by this comment and this one
@jvshahid My understanding from the conversation is that this feature request (OP) cannot be achieved in a non-racy (i.e. correct) way using sockets API alone.
However, as you said on our call earlier, the race condition may be unlikely enough that it meets our SLOs, especially when combined with our service discovery deregistration mechanism (NATS deregistration message to Cloud Foundry ingress router). End-users would see failures only when the NATS deregistration message is lost and when this TCP race condition is hit.
For that reason, I think we'd still find this feature valuable.
The alternative would be for us to do active healthchecking from the ingress router to the backend Envoy. That would be a lot of work in gorouter (and elsewhere in cloud foundry), and I'd have concerns about how it scales when a single ingress router is proxying to >10^5 backends (as we have in large Cloud Foundry deployments).
But I would also like to better understand the health-checking approach that @mattklein123 and @ggreenway seem to be discussing, in light of the scale targets we have in mind for CF.
----------------- ---------------------------------
| ingress router | ----> | sidecar Envoy ---> application |
----------------- ---------------------------------
N of these M of these
Our setup is N ingress proxies, each forwarding to M sidecar'd application instances. We constrain the drain duration to T seconds. With active health-checking from each ingress to each sidecar, that means each ingress router needs to average M/T health-checks per-second.
A large Cloud Foundry deployment has: N = 20 M = 200000 with T=10 seconds
So each ingress router is averaging 20k health-checks per second.
Is this realistic? Are we thinking about this wrong?
cc @emalm @shalako
@rosenhouse without knowing all the details of your architecture is hard to comment, but the way I would likely approach this would be to utilize a workflow in which you: 1) Know that an application is being destroyed (however you would invoke the close listener functionality) 2) Indicate via EDS response that the backend in question is not healthy, so no new requests/connections will be sent to it. 3) Wait some period of time. 4) Kill the application/remove from EDS.
~Thanks, this sounds like what we already do in Cloud Foundry.~
EDIT: We don't quite do this in Cloud Foundry
Coming back to this ticket (sorry for the long pause), I think we want to take 3. GET request to /healthcheck/fail out of consideration for this behavior change.
While cases 1. and 2. strongly suggest closing most listening sockets immediately, which I'll get into in a follow-up post, Case 3. doesn't fit this pattern.
If we consider the Note: behind this documented feature, the listening endpoint would no longer respond with a healthcheck failure status header that it promises to deliver, see;
[Edit to add] the endpoint is further unable to to respond to the ok filter ("The /healthcheck/ok admin endpoint reverses this behavior.") Given these contradictions, the entire healthcheck facility should be considered "advisory" and have no effect on polling the listener.
My follow-ups will focus on the first two cases, removing a configured listener or taking the process out of service.
I came here from #7841.
I built an internal tool at my company for handling graceful Envoy shutdown along with hot restarts. It was written in Rust and does the following:
/healthcheck/fail
to cause Envoy to fail healthchecks. Configure an Envoy health check endpoint in your Envoy configuration that your load-balancer will use to determine Envoy health.SIGTERM
to all child processes with a kill timeout: if processes don't exit within the kill timeout, SIGKILL
and exit 1.We provide the following knobs:
SIGTERM
before sending SIGKILL
.We're using systemd
, so we set TimeoutStopSec=
to the maximum sum of all periods to give the service ample time to shut down and drain.
Do note that we are not doing anything with TCP, rather we are relying on load-balancer health checks to ensure draining occurs. We set the eviction timeout to the maximum allowed request duration, but this could be set more aggressively.
It would be possible for us to add a hook into this system in order to use iptables
to stop accepting new TCP connections, but we aren't doing that presently.
I'm not sure if I will be able to open source this effort, but wanted to provide some breadcrumbs for others if they need a solution to this.
We're implementing lame duck mode for our project, so I may be taking this issue on. Not sure yet though, still scoping.
Given that Envoy sets REUSE_PORT by default now, it seems there is another reason to stop accepting connections in draining listeners: this would help shift traffic to another instance of envoy bound to the same port. We're thinking specifically of outbound connections that sidecar Envoy intercepts (hence, client "lame duck mode" does not help). We're aware of the inherent race in Linux TCP accept queue implementation as @ggreenway mentioned.
I stumbled across this feature request while working on shutdown behavior of Envoy used as sidecar proxy for ingress traffic and was surprised to not see any mention of the drain_listener?graceful
operation. It has worked wonders for me so I'll document the setup for feedback/comments.
We have gRPC service endpoints fronted by Envoy sidecars running on Kubernetes. During application shutdown, we somewhat control the sequence of POSIX signals sent to both Envoy and gRPC but are bound to terminate within a 30 seconds window. The current implementation uses active health checks and the healthcheck/fail
API to make clients evict the endpoint but this introduces challenges given the number of clients we have (who all perform active health checks towards all the endpoints they connect to). When using the drain_listener?graceful
call, Envoy does continue to listen on its socket but starts sending GOAWAY frames on every active connections while continuing to serve traffic. After the drain time period (configured on startup with --drain-time-s
), Envoy closes the listening socket but keeps serving traffic on active ones. By setting a short --drain-time-s
(eg. 0), Envoy effectively starts a draining sequence that prevents new connections from being accepted while allowing current connections enough time to drain. After the application drain time period has expired, a SIGTERM can be sent to Envoy. All actives connections are then reset but since we are after the grace period, that is the expected behavior of the system.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Is this still applicable given native sidecar support in Kubernetes?
update
We've edited this issue to be less prescriptive about solution. it now presents 3 possible solutions that we can see
summary
Given I've configured Envoy with LDS serving a TCP proxy listener on some port and there are connections in flight I would like a way to refuse subsequent TCP connections to that port while allowing current established connections to drain
We tried the following approaches, but none of them achieve our goals:
SIGTERM
GET
request to/healthcheck/fail
steps to reproduce
write a
bootstrap.yaml
likewrite a lds-current.yaml file like
launch envoy (I'm using v1.6.0)
confirm that the TCP proxy is working
Possible approach 1: remove the listener
update the LDS to return an empty set of listeners. this is a two step process. first, write an empty LDS response file lds-empty.yaml
second, move that file on top of the file being watched:
in the Envoy stdout logs you'll see a line
attempt to connect to the port where the listener used to be:
expected behavior
Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced.
actual behavior
the port is still open, even after the LDS update occurs
clients can connect to the port, but the TCP proxying seems to hang (can't tell where)
this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see
Possible approach 2: pkill -SIGTERM envoy
If instead of removing the listeners we signaled Envoy
Envoy exits immediately without allowing current connections to drain
EDITED to remove incorrect bit about listeners staying open after SIGTERM.
Possible approach 3: admin healthcheck fail
We could instead
GET /healthcheck/fail
to trigger this behavior. As above, we would expect that new TCP connections should be refused while existing TCP connections are serviced.background
In Cloud Foundry, we have the following setup currently:
Each application instance has a sidecar Envoy which terminates TLS connections from the shared ingress router. Applications may not speak HTTP, so we use basic TCP connectivity checks from the shared Router to the Envoy in order to infer application health and determine if a client connection should be load-balanced to that Envoy. When the upstream Envoy accepts the TCP connection, the Router considers that upstream healthy. When the upstream refuses the TCP connection, the Router considers that upstream unhealthy.
During a graceful shutdown, the scheduler ought to be able to drain the Envoy before terminating the application. This would mean that the Envoy ought to service any in-flight TCP connections without accepting any new ones.
acknowledgements
h/t @jvshahid and @emalm for investigation and edits