envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.69k stars 4.75k forks source link

How to make sure the legacy websocket connection still works if the listeners update without using the envoy hot-restart ? #35259

Closed wufanqqfsc closed 1 week ago

wufanqqfsc commented 1 month ago

Title: How to make sure the legacy websocket connection still works if the listeners update without using the envoy hot-restart ?

Description:

We are using the file system based LDS for dynamic resource update, and also envoy was also working as websocket proxy. If some lds (ip,socket options, or tls configs) change happen ,the listener will be draining and new listener will be created. But the legacy listeners's websocket connection will broken during these listeners' update period. So is there any methods or solution to handler the existing connections smoothly switch to the new listener?

KBaichoo commented 1 month ago

AFAIK, LDS will update in place for some filter chain changes but otherwise we will drain the existing listener which will drain the existing websocket connection as you've seen. AFAIK there's no mechanism to otherwise get around this.

wufanqqfsc commented 1 month ago

@KBaichoo if envoy listener can't do this , how envoy handle the legacy and new connections and traffic smoothly during some Control Plane configuration update ?

KBaichoo commented 1 month ago

see https://github.com/envoyproxy/envoy/blob/1abf5e106fd15d7636e306b02c08ca55ec4bbd27/source/common/listener_manager/listener_manager_impl.cc#L800 for how in place filter chain update works and the callers of it to see the conditions where that holds true.

I don't think it's a good idea to expand that criteria to other fields such as ip, socket options, etc.

See also https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-drain-time-s if you want to increase your drain timeout so drained WS connection live longer.

wufanqqfsc commented 1 month ago

@KBaichoo what will happen if the drain-time set to -1, seems the old version listener will not be draining any more, and the old connection will still usage able .And the new listener will also bind to the workers.

So after all the legacy connection in old listener filter chain was closed , the old version listener will continue draining or not ?

KBaichoo commented 1 month ago

I think it'll set the value to uint32_t::max which will effectively disable draining.

wufanqqfsc commented 1 month ago

yes, we have done some test. uint32_t::max or big value here may work , but our concern is if update listener resources many times and what will happen for these draining listeners objects . Is there any memory leak risk since these objects may can't be destroy since the draining timer is not triggered.

KBaichoo commented 1 month ago

Is there any memory leak risk since these objects may can't be destroy since the draining timer is not triggered.

I'd think so since you're preventing cleanup. You should measure it for yourself and to see if it's appropriate for your use case. It's a tradeoff between drain-timeout and resource leak delay. Maybe 1h? 3h? 6h? 12h?24h? might be sufficient for your drain timeout vs "never drain"

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.