bmcalary-atlassian commented 3 months ago

Current situation

Envoy offers a few existing options around listener draining during hot-restart and process shutdown.

--drain-strategy allows a choice between immediate or gradual (i.e drain-time-s) draining for hot-restarts.
--drain-time-s governs how long graceful draining (sending H2 GOAWAY or H1 Connection: Close) will take if triggered.
POST /drain_listeners?graceful&skip_exit&inboundonly allows an operator to tell Envoy to begin draining listeners, but still accept new connections

Gap

A problem emerges where we handle tens of millions of persistent connections and have competing goals for hot-restart draining length vs envoy shutdown or /drain_listeners?graceful draining length.

For hot restart, where we update listener configs across cross thousands of Envoy nodes simultaneously, we want a long >1+ hours draining of old listeners to prevent a storm of websocket re-handshakes.

For envoy shutdown (via POST /drain_listeners?graceful), which happens 1 node at a time, we want a much steeper/shorter graceful draining slope of ~5-10 minutes. (relevant to Auto scaling EC2 shutdown time)

In both cases, we want envoy and the listeners to continue accepting connections and applying H2 GOAWAY or H1 Connection: Close to transactions. We don't want to immediately "close" listeners and reject connections/requests.

Solutions

We're thinking of three possible solutions:

Add an additional query string argument to POST /drain_listeners?graceful which allows the operator to specify a shorter drain time. E.g POST /drain_listeners?graceful&drain-time-s=300
Add a --drain-time-s-triggered for POST /drain_listeners?graceful which allows the operator to specify a shorter drain time.
Allow a different choice of immediate vs gradual drain-strategy for hot-restarts vs calling /drain_listeners (this is probably acceptable, but less preferable)

Relevant docs: https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-drain-strategy https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-drain

adisuissa commented 3 months ago

cc @ravenblackx who has been working on hot-restart recently

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

bmcalary-atlassian commented 2 months ago

Can we add no stalebot?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

bmcalary-atlassian commented 1 month ago

Please mark no-stalebot.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

bmcalary-atlassian commented 1 week ago

Please mark no-stalebot.

envoyproxy / envoy

Different drain time for POST drain_listeners vs LDS replacement/update drain time #34500

Current situation

Gap

Solutions