In the context of Envoy Gateway, a reconciliation crash would have several undesired side affects:
last-known-good XDS caches would be deleted and not recovered after a restart
infra manager disrupted during infra reconciliation, possibly creating an inconsistent infra state where only some changes are applied
If a crash occurs during an upgrade, there is a risk that envoy proxies would be replaced (e.g. due to a new proxy version being used), but no configuration is provided by the control plane, leading to a complete outage for users.
Envoy Gateway should consider recovering from panics by default or allowing users to opt-in for panic recovery. If implemented, metrics should be provided to users, so that operators are made aware of the fact that XDS translation is broken.
Description: Currently, a panic in the reconciliation flow of Envoy Gateway will lead to EG crashing: #4291, #2661, #1830, #2882.
Controller frameworks like controller runtime and api-machinery provide the means to recover from panics:
In the context of Envoy Gateway, a reconciliation crash would have several undesired side affects:
If a crash occurs during an upgrade, there is a risk that envoy proxies would be replaced (e.g. due to a new proxy version being used), but no configuration is provided by the control plane, leading to a complete outage for users.
Envoy Gateway should consider recovering from panics by default or allowing users to opt-in for panic recovery. If implemented, metrics should be provided to users, so that operators are made aware of the fact that XDS translation is broken.