Proposal: allow h2 ping optionally to reset idle timeouts

goaway commented 5 years ago

Title: Allow h2 ping optionally to reset idle timeouts

Description: h2 ping is currently used by gRPC as a keep-alive mechanism. This seems like a useful notion in general, and exposing h2 pings at a higher layer is potentially useful for filters as well.

Relevant Links:

mattklein123 commented 5 years ago

+1 I think we should add this as an optional way to handle streaming keep-alive. This has also come up in the context of https://github.com/envoyproxy/envoy/issues/6323. cc @alyssawilk

ryanrhee commented 5 years ago

~JW - is this different from https://github.com/envoyproxy/envoy/issues/2086 ?~

https://github.com/envoyproxy/envoy/issues/2086#issuecomment-345885469 is insightful. AFAIK, this issue is not to forward http/2 PINGs, but rather allowing such messages to reset the idle timeouts. And envoy will still directly respond to those PINGs. Am I understanding this correctly?

Would this apply to both idle_timeout and stream_idle_timeout?

mattklein123 commented 4 years ago

cc @ggreenway this is related to our recent discussions. This might be another way of handling your issue in terms of how it might be configured generally for an h2 connection.

vroldanbet commented 7 months ago

Hey folks! 👋🏻 We (@authzed) run fleets of Go services behind Envoy (managed by Contour) and have experienced issues with long-running streaming gRPC APIs. Independently of client-side configured KeepAlives, connections eventually timeout, presumably because the idle timeout is not being reset by the keepalive ping, as described in this issue.

I'm surprised this issue hasn't gotten more attention, which prompted me to comment. It makes me wonder if folks:

a) are not using long-lived gRPC streaming that often
b) are recommending client-side retries when the connection times out
c) are removing streaming API timeouts altogether at the load-balancer
d) have found a solution to this issue

What's the general guidance to keep streaming gRPC connections alive for long when Envoy sits in the middle? Is introducing idle timeout resets after keep-alive pings something Envoy maintainers would welcome as a contribution?

dethi commented 7 months ago

Hey @vroldanbet 👋🏻 , we (Arista Networks) are having the same kind of architecture. Our telemetry platform is using Envoy as a gateway, with long-running gRPC streams (100s thousands) used to notify clients/apps of new updates, some of which may not expect any updates for multiple hours/days.

Our solution is to set the idle timeout on the listener to 1d (for trusted traffic) and to disable completely the stream idle timeout on upstream connection. Our clients are expected to retry/reconnect when stream reset/disconnect. This was at the beginning not ideal for us and didn't feel as strong as what we had before, when client were connecting directly to the server. But 2y later, my opinion has changed quite a bit.

You want to protect your proxy layer from misbehaving clients (like a client that open a stream but never read it), and thus I can see why h2 ping frames are not resetting the idle timeouts. Also, as soon as you use use any kind of the Cloud vendor L7 LB in front of your Envoy instances, it's game-over anyway as you have almost no control. Trying to keep any streams open indefinitely becomes an impossible task...

vroldanbet commented 7 months ago

Hey @dethi thanks for chiming in and sharing your use-case!

Agreed - no connection can last forever, even if there is no timeout, any component in the chain may churn at any time (network issues, hardware failures, pods rolling in a kube cluster, managed LB getting updated).

Disabling idle timeout feels like a workaround to a gap in the protocol support, but I can see it as a pragmatic solution. And you make a good point that having keep-alive pings reset idle timeout leads to the same leaking scenario, which is my primary concern with not having timeouts.

alyssawilk commented 7 months ago

My inclination is we want to keep stream and connection idle timeouts separate, so there should be a knob to make sure there's been some traffic on the connection (pings, tcp keepalive, whatever) and a knob to make sure there's some progress on the stream (if progress on the stream is expected - it may not be for your situation), but having a connection timeout affect a stream's timeout seems sub-optimal. If the stream isn't expected to send data at any point, why not disable the stream idle timeout?

envoyproxy / envoy

Proposal: allow h2 ping optionally to reset idle timeouts #6464