Consider default TCP keep alive for xDS connections

mattklein123 commented 5 years ago

Related to https://github.com/envoyproxy/envoy/issues/5173. I've been thinking about this, and I think we should consider having default TCP keep alive settings for xDS connections, with maybe an option to fully disable if it's not wanted.

I think this issue is going to hit almost everyone and we can have a better OOB experience.

Thoughts @suhailpatel @htuch @fredlas?

htuch commented 5 years ago

Yeah, seems reasonable. I wonder how this relates to gRPC keep-alive, https://github.com/grpc/grpc/blob/master/doc/keepalive.md. You'd want to configure that for the Google client...

sschepens commented 5 years ago

We were affected by this issue, we're about to add grpc healthchecks to our controlplane clusters.

By the way, this happens for both sides, if a client abruptly dies, ControlPlanes will fail to notice since the connections don't terminate normally. We fixed this with grpc keepalives on the ControlPlane since at least grpc for Java supports this.

It would be great to add better out of the box behaviour and also add support for grpc keepalives in envoy

suhailpatel commented 5 years ago

I would be in favour of default TCP for XDS and I don’t think it’d be a breaking change.

gRPC Keep Alives are useful but ideally we’d have this for REST too to provide a consistent view for all xDS connections.

Want me to pick this up? I can look at the gRPC Keep Alives if time permits as an addition to Envoy.

mattklein123 commented 5 years ago

gRPC Keep Alives are useful but ideally we’d have this for REST too to provide a consistent view for all xDS connections.

Agreed, I think TCP keep alive should work for all cases, unless the user's environment does not support it. (Though is that even possible given how it's implemented?)

Want me to pick this up?

That would be fantastic. I think we should probably define some sane defaults for xDS connections, and then allow the user to override them if they desire. This will probably take a little bit of thought in terms of how to implement since today the cluster and all cluster settings are fully decoupled from the calling code, so we may need to build on some mechanism to override a cluster's settings. @suhailpatel if you do want to implement maybe take a look at the code and we can discuss implementation if you have questions before doing too much work?

ramaraochavali commented 5 years ago

I am interested in how this comes up. Can we create a google doc and we can discuss there?

mattklein123 commented 5 years ago

@ramaraochavali which parts do you want to discuss in the doc? The settings we will use, the implementation details, or both?

htuch commented 5 years ago

Agreed, I think TCP keep alive should work for all cases, unless the user's environment does not support it. (Though is that even possible given how it's implemented?)

This is true, but when the user configures the Google gRPC client, these TCP keep alive settings are not used..

ramaraochavali commented 5 years ago

which parts do you want to discuss in the doc? The settings we will use, the implementation details, or both?

both actually. In the implementation, for example how this would for Google gRPC client as @htuch mentioned and very high level implementation. If gDOC is too heavy, I am OK putting some high level approach (settings + implementation in the issue here it self)

mattklein123 commented 5 years ago

Agreed with @ramaraochavali that lets do a writeup either in a gdoc or here with what we propose doing, there are a bunch of things to think about.

I'm wondering if we should have some type of config option "disable_keep_alive" on a config source (so that it defaults to true) and then we somehow plumb this into both the google client and the async client. It's actually possible we could do this with transport socket creation options vs. having to modify the cluster, but I would need to look through all the code. cc @klarose

klarose commented 5 years ago

Is there a simple way to determine whether a cluster is an xDS cluster from its cluster config? If so, we could probably just default the keepalives to true in parseClusterSocketOptions, which returns any options to be used for the upstream connection.

suhailpatel commented 5 years ago

I'll get started on a Google Doc soon (most likely this weekend) and share it around. I was thinking checking in the XDS layer if connections had some sort of health check enabled, if not enabled, it forces it with some defaults unless explicitly configured to disable health checks at the XDS config layer. That was my naive thinking from the last time I looked at the XDS code. I'll have another rummage at the source and propose something.

On a slight tangent, is there any particular reason why the Google gRPC client doesn't do TCP Keep Alives? Whilst you have the gRPC health check which does achieve a similar end result, one is at the transport layer vs protocol layer. Would it make sense to add the TCP Keep Alive support in the Google gRPC client too? (possibly not in this issue to avoid scope creep)

htuch commented 5 years ago

@suhailpatel I'm not sure what all the knobs are for Google gRPC C core keep-alive configuration, the main thing to keep in mind is that it doesn't use Envoy's TCP connection factories and hence we don't have any of the tuning options that exist Envoy-side out-of-the-box.

amitkatyal commented 2 years ago

We are using the envoy-proxy in the k8s env and using the go control plane over envoy-grpc to configure the envoy proxy. Recently we came across the issue where envoy-proxy netstat output shows that envoy-proxy --- control-plane TCP connection state is established, but the control plane netstat output doesn't show the TCP connection entry with the envoy-proxy. It seems the control plane service was restarted abruptly w/o terminating the existing GRPC (TCP) connection. In this case, wouldn't envoy-grpc TCP keepalive should be kicked in after the default TCP idle timeout (i.e. 2hrs) and recover? Are TCP keepalive on the envoy-grpc client is disabled by default and the recommendation is to enable TCP keepalive under the control plane cluster configuration as suggested in the below envoy-proxy documentation?

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol.html?highlight=recommended%20configure%20either%20http%202%20tcp%20keepalives%20order%20detect%20connection%20issues%20allow%20envoy%20reconnect%20tcp%20keepalive%20less%20expensive%20may%20inadequate%20tcp%20proxy%20between%20envoy%20management%20server%20http%202%20keepalive%20slightly%20more%20expensive%20may%20detect%20issues%20through%20more%20types%20intermediate%20proxies

envoyproxy / envoy

Consider default TCP keep alive for xDS connections #6323