envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.55k stars 333 forks source link

Recover from reconciler panics #4332

Open guydc opened 3 days ago

guydc commented 3 days ago

Description: Currently, a panic in the reconciliation flow of Envoy Gateway will lead to EG crashing: #4291, #2661, #1830, #2882.

Controller frameworks like controller runtime and api-machinery provide the means to recover from panics:

In the context of Envoy Gateway, a reconciliation crash would have several undesired side affects:

If a crash occurs during an upgrade, there is a risk that envoy proxies would be replaced (e.g. due to a new proxy version being used), but no configuration is provided by the control plane, leading to a complete outage for users.

Envoy Gateway should consider recovering from panics by default or allowing users to opt-in for panic recovery. If implemented, metrics should be provided to users, so that operators are made aware of the fact that XDS translation is broken.