Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain

rosenhouse commented 6 years ago

update

We've edited this issue to be less prescriptive about solution. it now presents 3 possible solutions that we can see

summary

Given I've configured Envoy with LDS serving a TCP proxy listener on some port and there are connections in flight I would like a way to refuse subsequent TCP connections to that port while allowing current established connections to drain

We tried the following approaches, but none of them achieve our goals:

the LDS is updated to remove the listener
Envoy is signalled with a SIGTERM
Send a GET request to /healthcheck/fail

steps to reproduce

write a bootstrap.yaml like

---
admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

node:
  id: some-envoy-node
  cluster: some-envoy-cluster

dynamic_resources:
  lds_config:
    path: /cfg/lds-current.yaml

static_resources:
  clusters:
  - name: example_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: 93.184.216.3    # IP address of example.com
        port_value: 80

write a lds-current.yaml file like

version_info: "0"
resources:
- "@type": type.googleapis.com/envoy.api.v2.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8080
  filter_chains:
    - filters:
        - name: envoy.tcp_proxy
          config:
            stat_prefix: ingress_tcp
            cluster: example_cluster

launch envoy (I'm using v1.6.0)

envoy -c /cfg/bootstrap.yaml --v2-config-only --drain-time-s 30

confirm that the TCP proxy is working

curl -v -H 'Host: example.com' 127.0.0.1:8080

Possible approach 1: remove the listener

update the LDS to return an empty set of listeners. this is a two step process. first, write an empty LDS response file lds-empty.yaml

version_info: "1"
resources: []

second, move that file on top of the file being watched:

mv lds-empty.yaml lds-current.yaml

in the Envoy stdout logs you'll see a line

source/server/lds_api.cc:68] lds: remove listener 'listener_0'

attempt to connect to the port where the listener used to be:

curl -v -H 'Host: example.com' 127.0.0.1:8080

expected behavior

Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced.

actual behavior

the port is still open, even after the LDS update occurs

lsof -i
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
envoy     1 root    9u  IPv4  30166      0t0  TCP *:9901 (LISTEN)
envoy     1 root   22u  IPv4  30171      0t0  TCP *:8080 (LISTEN)

clients can connect to the port, but the TCP proxying seems to hang (can't tell where)

curl -H 'Host: example.com' -v 127.0.0.1:8080
* Rebuilt URL to: 127.0.0.1:8080/
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/7.47.0
> Accept: */*
>
^C

this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see

curl 127.0.0.1:8080
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused

Possible approach 2: pkill -SIGTERM envoy

If instead of removing the listeners we signaled Envoy

pkill -SIGTERM envoy

Envoy exits immediately without allowing current connections to drain

[2018-03-28 17:42:06.995][3563][warning][main] source/server/server.cc:312] caught SIGTERM
[2018-03-28 17:42:06.995][3563][info][main] source/server/server.cc:357] main dispatch loop exited
[2018-03-28 17:42:07.004][3563][info][main] source/server/server.cc:392] exiting

EDITED to remove incorrect bit about listeners staying open after SIGTERM.

Possible approach 3: admin healthcheck fail

We could instead GET /healthcheck/fail to trigger this behavior. As above, we would expect that new TCP connections should be refused while existing TCP connections are serviced.

background

In Cloud Foundry, we have the following setup currently:

       Router         =====>       Envoy    ---->    App
(shared ingress)      (TLS)                 TCP

Each application instance has a sidecar Envoy which terminates TLS connections from the shared ingress router. Applications may not speak HTTP, so we use basic TCP connectivity checks from the shared Router to the Envoy in order to infer application health and determine if a client connection should be load-balanced to that Envoy. When the upstream Envoy accepts the TCP connection, the Router considers that upstream healthy. When the upstream refuses the TCP connection, the Router considers that upstream unhealthy.

During a graceful shutdown, the scheduler ought to be able to drain the Envoy before terminating the application. This would mean that the Envoy ought to service any in-flight TCP connections without accepting any new ones.

acknowledgements

h/t @jvshahid and @emalm for investigation and edits

rosenhouse commented 6 years ago

cc @rshriram

mattklein123 commented 6 years ago

Yeah I think we can probably add listener drain configuration to better determine what happens when a listener enters drain mode. I.e., an option can be to just start refusing connections immediately when draining starts.

wrowe commented 6 years ago

In Approach 2. above, typically a different signal is used for a graceful stop, such as SIGHUP. SIGTERM well-known behavior should not be altered.

wrowe commented 6 years ago

Note also in the observations for 2. above, that keeping the listener alive after SIGTERM is clearly erroneous, a bug in its own right; that is unrelated to the desire for a TCP drain functionality.

rosenhouse commented 6 years ago

That may be a formatting oversight in the issue. I'm pretty sure Envoy just quits on SIGTERM. Let me fix that in the issue description.

jukylin commented 6 years ago

Now, Envoy support graceful shutdown?

wrowe commented 6 years ago

To sum up my research to date, and of the three pathways above aught to work, but none will, due to the underlying libevent listener implementation in 2.1.8 released Feb of '17.

It appears deliberate that disabling a listener simply removes the socket from the poll queue, collecting connections against the backq, until then listener is re-enabled.

Work on the listener implementation has been merged to the 2.2.0 dev master, so I am working on a minimal repro case to discuss as a libevent ticket, solicit workarounds effective for 2.0/2.1/2.2, and determine whether this represents a defect or enhancement.

ggreenway commented 6 years ago

Can you elaborate on what issues you're having with libevent?

I'm working on implementing something similar right now. I'm trying to solve a slightly different problem, but it has a lot of overlap. But I have envoy closing its listener (as reported by netstat).

ggreenway commented 6 years ago

See #3307. Would that resolve this issue for you?

rosenhouse commented 6 years ago

@ggreenway Yes, thank you.

In the future, we may want a means for certain listeners to opt-out of this behavior. For example, in a sidecar proxy scenario, during the shutdown & drain time, we'd want to close the ingress listeners but leave any egress listeners open until the very end. But we can wait on that enhancement. For now, #3307 would be great.

mattklein123 commented 6 years ago

@ggreenway @rosenhouse note that we already support selective per-listener drain options via https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/lds.proto#L121 (and use it for same reasons at Lyft). @ggreenway please also consider ^ when you are looking into the final solution.

ggreenway commented 6 years ago

@rosenhouse For clarity, in your ideal scenario, can you write a timeline (or ordered list) of when ingress and egress listeners would be drained and closed?

rosenhouse commented 6 years ago

ingress listeners stop accepting new connections (clients see TCP connection refused) but continues to service existing connections. egress listeners are completely unaffected.
configurable delay to allow workload to finish servicing existing connections.
envoy (and workload) both terminate

@emalm does that look right?

ggreenway commented 6 years ago

Thanks

emalm commented 6 years ago

Thanks, @rosenhouse, that sequence looks right to me!

ggreenway commented 6 years ago

Not sure if this is important to you or not, but 1 caveat to keep in mind: if you have no way to tell your clients to stop connecting before you close listeners (to cause TCP clients to see connection refused), the listener may have some connections in the listen() backlog when the listener is closed. There is no race-free way to avoid this with the socket API. Those connections will get RST, but they may have already sent some request data.

rosenhouse commented 6 years ago

@ggreenway interesting, I wasn't aware of that. Do you have any references you could point to that describe this?

ggreenway commented 6 years ago

The documentation is scarce, and results may be OS-dependent. I did testing on linux, using a simple test client and server application. The server calls listen() but never calls accept(). The client connects (using blocking socket calls), then sends some bytes. tcpdump shows the client SYN, server SYN/ACK, client ACK, and the client sending data, and the server ACK'ing the data.

The reason is that, at least on linux, accept() will always return a fully-established connection socket, not a half-open socket. So to meet that requirement, the server OS must tell the client that is is connected. Once it is connected, the client may fill up at least 1 TCP Window worth of data.

wrowe commented 6 years ago

@ggreenway I see we've been mulling the same issue.

I'm also convinced libevent is wrong; http://www.wangafu.net/~nickm/libevent-book/Ref8_listener.html Enabling and disabling an evconnlistener int evconnlistener_disable(struct evconnlistener lev); int evconnlistener_enable(struct evconnlistener lev); These functions temporarily disable or reenable listening for new connections.'

They do no such thing, of course; the listener remains open accepting connections, while the event loop stops accepting them during the disabled period.

@rosenhouse Greg's answer above is good, James Briggs calls out how this can be accomplished with iptables. http://www.jebriggs.com/blog/2016/10/linux-graceful-service-shutdown-techniques/

Stevens Unix Network Programming vol 1 section 4.5 goes into quite a bit of detail across the variety of network stack implementations, section 30.7 goes into the two queue lengths. Linux does not implement changing the backlog queue length to 0 (queueing incoming SYNs to later be dropped).

I'm beginning to look at whether packet filtering might help us here to achieve our choice of dropping the SYN on the floor, or ACK+RST the syn request.

mattklein123 commented 6 years ago

@wrowe technically I understand why all of this is being discussed, but practically, it's really hard for me to understand why this makes operational sense for any deployment. This is exactly why Envoy supports health check draining. In what situation are you not able to drain traffic via failing health checks of some type?

ggreenway commented 6 years ago

I understand your point about libevent. I imagine that was written from the point of view of a library that provides events for things (and also happens to manage sockets sometimes). I think it could be rephrased more correctly as "These functions temporarily disable or reenable listening for events for new connections." But I think that's beside the point, because the socket API doesn't provide for what you want to do (at least on linux). As you pointed out, you'll probably need to do some kind of packet filtering first.

If you're going down the road of packet filtering, you may not need any change to envoy. You can add a rule to drop all SYNs, wait until connections close or you've waited as long as you want to, then kill envoy.

ggreenway commented 6 years ago

@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?

ggreenway commented 6 years ago

But in the absence of /healthcheck/fail (like for TcpProxy), you need some way external to envoy to drain traffic. IPTables is one such way.

mattklein123 commented 6 years ago

@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?

In every deployment I've ever done it does. Basically, HTTP health check can be used to fail an Envoy out of some rotation, whether that be a load balancer, DNS, BGP, etc. Then after some period of time the process can be terminated/restarted/etc. I've just never seen a need for this type of control so it's kind of surprising to me.

ggreenway commented 6 years ago

Yeah, I see what you mean. I meant specifically /healthcheck/fail. But yes, there must be some other mechanism for doing the same thing. I think you and I are in agreement here.

jvshahid commented 6 years ago

I'm having trouble following this thread starting from this comment. Up to that comment, the conversation has been around making the current linux socket api drain in-flight connections (i.e. that have finished the 3-way handshake but aren't accepted yet by the application). @mattklein123 can you explain to me what do you mean by this comment and this one

rosenhouse commented 6 years ago

@jvshahid My understanding from the conversation is that this feature request (OP) cannot be achieved in a non-racy (i.e. correct) way using sockets API alone.

However, as you said on our call earlier, the race condition may be unlikely enough that it meets our SLOs, especially when combined with our service discovery deregistration mechanism (NATS deregistration message to Cloud Foundry ingress router). End-users would see failures only when the NATS deregistration message is lost and when this TCP race condition is hit.

For that reason, I think we'd still find this feature valuable.

The alternative would be for us to do active healthchecking from the ingress router to the backend Envoy. That would be a lot of work in gorouter (and elsewhere in cloud foundry), and I'd have concerns about how it scales when a single ingress router is proxying to >10^5 backends (as we have in large Cloud Foundry deployments).

rosenhouse commented 6 years ago

But I would also like to better understand the health-checking approach that @mattklein123 and @ggreenway seem to be discussing, in light of the scale targets we have in mind for CF.

-----------------                 ---------------------------------
| ingress router |     ---->     |  sidecar Envoy ---> application |
-----------------                 ---------------------------------
    N of these                               M of these

Our setup is N ingress proxies, each forwarding to M sidecar'd application instances. We constrain the drain duration to T seconds. With active health-checking from each ingress to each sidecar, that means each ingress router needs to average M/T health-checks per-second.

A large Cloud Foundry deployment has: N = 20 M = 200000 with T=10 seconds

So each ingress router is averaging 20k health-checks per second.

Is this realistic? Are we thinking about this wrong?

cc @emalm @shalako

mattklein123 commented 6 years ago

@rosenhouse without knowing all the details of your architecture is hard to comment, but the way I would likely approach this would be to utilize a workflow in which you: 1) Know that an application is being destroyed (however you would invoke the close listener functionality) 2) Indicate via EDS response that the backend in question is not healthy, so no new requests/connections will be sent to it. 3) Wait some period of time. 4) Kill the application/remove from EDS.

rosenhouse commented 6 years ago

~Thanks, this sounds like what we already do in Cloud Foundry.~

EDIT: We don't quite do this in Cloud Foundry

wrowe commented 6 years ago

Coming back to this ticket (sorry for the long pause), I think we want to take 3. GET request to /healthcheck/fail out of consideration for this behavior change.

While cases 1. and 2. strongly suggest closing most listening sockets immediately, which I'll get into in a follow-up post, Case 3. doesn't fit this pattern.

If we consider the Note: behind this documented feature, the listening endpoint would no longer respond with a healthcheck failure status header that it promises to deliver, see;

https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/health_check_filter#config-http-filters-health-check

[Edit to add] the endpoint is further unable to to respond to the ok filter ("The /healthcheck/ok admin endpoint reverses this behavior.") Given these contradictions, the entire healthcheck facility should be considered "advisory" and have no effect on polling the listener.

My follow-ups will focus on the first two cases, removing a configured listener or taking the process out of service.

naftulikay commented 5 years ago

I came here from #7841.

I built an internal tool at my company for handling graceful Envoy shutdown along with hot restarts. It was written in Rust and does the following:

Handles the hot-reload functionality that the Python wrapper provides.
Implements graceful shutdown by doing the following, in order:
- Upon a shutdown signal, POSTs to /healthcheck/fail to cause Envoy to fail healthchecks. Configure an Envoy health check endpoint in your Envoy configuration that your load-balancer will use to determine Envoy health.
- Sleep for a configurable eviction period to allow the Envoy instance to be removed from the load balancer so that new connections aren't routed to it. If interrupted, jump to the shutdown scheme.
- Query the statistics interface to see if there are active requests. Once this number is zero and confirmed as such by three subsequent queries, proceed to shutdown. If this does not happen within a drain timeout, then shut down and exit 1.
- Send a SIGTERM to all child processes with a kill timeout: if processes don't exit within the kill timeout, SIGKILL and exit 1.

We provide the following knobs:

Eviction period: the amount of time to sleep before querying the stats interface.
Drain timeout: the maximum amount of time to wait for draining to occur during querying the stats interface.
Drain interval: the amount of time to sleep between drain query requests.
Drain attempts: the amount of successful drain query requests required in a row to mark Envoy as being successfully drained.
Kill timeout: the maximum amount of time to wait for child processes to exit after SIGTERM before sending SIGKILL.

We're using systemd, so we set TimeoutStopSec= to the maximum sum of all periods to give the service ample time to shut down and drain.

Do note that we are not doing anything with TCP, rather we are relying on load-balancer health checks to ensure draining occurs. We set the eviction timeout to the maximum allowed request duration, but this could be set more aggressively.

It would be possible for us to add a hook into this system in order to use iptables to stop accepting new TCP connections, but we aren't doing that presently.

I'm not sure if I will be able to open source this effort, but wanted to provide some breadcrumbs for others if they need a solution to this.

auni53 commented 4 years ago

We're implementing lame duck mode for our project, so I may be taking this issue on. Not sure yet though, still scoping.

kyessenov commented 3 years ago

Given that Envoy sets REUSE_PORT by default now, it seems there is another reason to stop accepting connections in draining listeners: this would help shift traffic to another instance of envoy bound to the same port. We're thinking specifically of outbound connections that sidecar Envoy intercepts (hence, client "lame duck mode" does not help). We're aware of the inherent race in Linux TCP accept queue implementation as @ggreenway mentioned.

villainb-dg commented 9 months ago

I stumbled across this feature request while working on shutdown behavior of Envoy used as sidecar proxy for ingress traffic and was surprised to not see any mention of the drain_listener?graceful operation. It has worked wonders for me so I'll document the setup for feedback/comments.

We have gRPC service endpoints fronted by Envoy sidecars running on Kubernetes. During application shutdown, we somewhat control the sequence of POSIX signals sent to both Envoy and gRPC but are bound to terminate within a 30 seconds window. The current implementation uses active health checks and the healthcheck/fail API to make clients evict the endpoint but this introduces challenges given the number of clients we have (who all perform active health checks towards all the endpoints they connect to). When using the drain_listener?graceful call, Envoy does continue to listen on its socket but starts sending GOAWAY frames on every active connections while continuing to serve traffic. After the drain time period (configured on startup with --drain-time-s), Envoy closes the listening socket but keeps serving traffic on active ones. By setting a short --drain-time-s (eg. 0), Envoy effectively starts a draining sequence that prevents new connections from being accepted while allowing current connections enough time to drain. After the application drain time period has expired, a SIGTERM can be sent to Envoy. All actives connections are then reset but since we are after the grace period, that is the expected behavior of the system.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 days ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Spitfire1900 commented 2 hours ago

Is this still applicable given native sidecar support in Kubernetes?

envoyproxy / envoy