control plane fail-static backup: XDS fallback to low dependency config store when control plane is down

stevenzzzz commented 2 years ago

Title: control plane fail-static backup: XDS fallback to low dependency config store when control plane is down

Description: In outages like grpc lib error, or network issue, control plane global outages, we'd hope the proxy server to still work with existing config, but the config would go away if the running Envoy starts, in which case the dataplane is cascadingly down due to control plane outage.

This is extremely bad if a control plane outage is coupled with a data-plane update, that means we'd loose all our server capacity(zonal or even global).

while waiting for the control plane to come back, we'd like to have a fail-static config backup store that could allow restarted Envoys to load [possibly degraded/stale] config, serve the traffic as if control plane is just slow in delivering config data.

Ideally this "fail static" backup config store would:

Automatically kick in on connection failures (retries drained) to management servers,
Be reliable/high-available, most likely zonal-available( i.e. it's available from within the same zone). Preferably "disk like" so that it could survive outages caused by say GRPC lib or network outage. (For cloud users, the high-available GCS server/Blob-store could be such a storage layer. For Envoy Mobile, this could be the local cache layer before control plane connection is not available.
Config data on the storage would be sharded, or mixed, we would need some mechanism to "filter " against the resources streams of the data.

@adisuissa @htuch

[optional Relevant Links:]

Any extra documentation required to understand the issue.

adisuissa commented 2 years ago

I agree that a fall-back mechanism between different config sources should be added. For this we first need to define what exactly an outage is, and how it is detected, and then Envoy will be able to fall-back to a different config source. There is an open question of how/when to try to switch back to the "primary" config source. I believe this is the "easy" part, and we can converge on a design to this pretty quickly.

I think that the FileSystem subscription can be built upon for your scenario. It probably won't need to support delta-xDS, but will need to understand/emulate the functionality that is being provided by the gRPC subscriptions (such as support for a node's metadata and the xDSTP naming and protocol). We should avoid adding server-specific code (e.g., filtering) into this module, but I think we should support extensions where that kind of code can be added.

IMHO, file-system should not be the encouraged way to solve this problem for most use-cases. A better approach would be to have multiple gRPC config-servers in the same zone that all serve the same config. To tackle the gRPC lib issue, the rollout of the server binary should be done gradually. If there's a network outage in the zone, it could also be that the highly-available filesystem store might also be unavailable.

That said, I do see some use-cases where a file-system might be useful, specifically for Envoy-Mobile scenarios. Note that in these use-cases it will also be beneficial if Envoy has a way to store the config into disk. This needs a further discussion as writing to disk with the main thread might incur a high penalty.

mattklein123 commented 2 years ago

This issue has been brought up again and again over the years. In general I'm strongly against building in caching to Envoy, as I think it's working around more fundamental design issues wrt to control plane resilience. Obviously new instances cannot get any config.

With that said, I think building fallback config sources is the way to solve this with the potential for FS fallback (as long as we don't build caching into Envoy directly), so in general this sgtm.

htuch commented 2 years ago

Yeah, I think the plan here would be to allow Envoy to read from a static config source with filesystem like semantics, but it would never write. It would essentially be a static read-only source to be used while other config sources are not functioning. The actual caching and updating of this would take place completely outside of Envoy. We also wouldn't build in specific filesystems, this would be an extension providing some subset of POSIX-like file operations. The existing file xDS would be a good foundation as @adisuissa suggests, with potential improvements for dealing with filtering down configurations with dynamic context parameters.

stevenzzzz commented 2 years ago

+1 to Harvey and All, all very good feedback.

This is not a caching layer into Envoy, more a disaster recovery measure in addition to existing multiple API config sources(delta XDS not supporting yet) to defend against global config server meltdown/link-to-config-server-is-broken.

Let's have a design doc and move discussion there.

adisuissa commented 2 years ago

cc @abeyad

alyssawilk commented 2 years ago

cc @alyssawilk :-P

dastbe commented 2 years ago

this would especially be useful for deployments of envoy where the lifecycle of envoy is independent of what it's serving traffic for, ex. being able to bootstrap a host agent with a recent state rather regardless of control plane availability.

Note that in these use-cases it will also be beneficial if Envoy has a way to store the config into disk. This needs a further discussion as writing to disk with the main thread might incur a high penalty.

it looks like you could build this yourself on top of the [config dump api|https://www.envoyproxy.io/docs/envoy/latest/api-v3/admin/v3/config_dump.proto#envoy-v3-api-msg-admin-v3-configdump] but having a way to get envoy to materialize a local backup on demand that was directly reloadable would be nice.

envoyproxy / envoy

control plane fail-static backup: XDS fallback to low dependency config store when control plane is down #21644