Open bmcalary-atlassian opened 3 months ago
cc @ravenblackx who has been working on hot-restart recently
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Can we add no stalebot?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Please mark no-stalebot.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Please mark no-stalebot.
Current situation
Envoy offers a few existing options around listener draining during hot-restart and process shutdown.
--drain-strategy
allows a choice betweenimmediate
orgradual
(i.e drain-time-s) draining for hot-restarts.--drain-time-s
governs how long graceful draining (sending H2GOAWAY
or H1Connection: Close
) will take if triggered.POST /drain_listeners?graceful&skip_exit&inboundonly
allows an operator to tell Envoy to begin draining listeners, but still accept new connectionsGap
A problem emerges where we handle tens of millions of persistent connections and have competing goals for hot-restart draining length vs envoy shutdown or /drain_listeners?graceful draining length.
For hot restart, where we update listener configs across cross thousands of Envoy nodes simultaneously, we want a long >1+ hours draining of old listeners to prevent a storm of websocket re-handshakes.
For envoy shutdown (via
POST /drain_listeners?graceful
), which happens 1 node at a time, we want a much steeper/shorter graceful draining slope of ~5-10 minutes. (relevant to Auto scaling EC2 shutdown time)In both cases, we want envoy and the listeners to continue accepting connections and applying H2
GOAWAY
or H1Connection: Close
to transactions. We don't want to immediately "close" listeners and reject connections/requests.Solutions
We're thinking of three possible solutions:
POST /drain_listeners?graceful
which allows the operator to specify a shorter drain time. E.gPOST /drain_listeners?graceful&drain-time-s=300
--drain-time-s-triggered
forPOST /drain_listeners?graceful
which allows the operator to specify a shorter drain time.immediate
vsgradual
drain-strategy
for hot-restarts vs calling/drain_listeners
(this is probably acceptable, but less preferable)Relevant docs: https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-drain-strategy https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining https://www.envoyproxy.io/docs/envoy/latest/operations/admin#operations-admin-interface-drain