envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

Filesystem xDS races on updates leading to incorrect rejections #31678

Open dimo414 opened 10 months ago

dimo414 commented 10 months ago

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn't returned, etc.

I am observing occasional situations where my application unexpectedly fails to come online correctly, as requests to backends that are up, healthy, and registered with Envoy via the filesystem xDS are rejected. Here's some simplified code from the startup program:

while True:
  services_to_add = poll_for_new_services()
  if services_to_add:
    for service_name in services_to_add:
        cluster = f"service_{service_name}"
        # creates a config.cluster.v3.Cluster to be written to the CDS file
        envoy_cfg.add_cluster(cluster, service_name)
        # creates a config.route.v3.Route to be added to the main config.listener.v3.Listener in the LDS file
        envoy_cfg.add_route(f"/{grpc_service}", cluster)
    # refresh the CDS and LDS files
    # note CDS is refreshed first because an LDS update will fail if it mentions any unknown clusters
    envoy_cfg.refresh_cds()
    envoy_cfg.refresh_lds()
  time.sleep(POLLING_INTERVAL)

The note there might give it away, but I'm seeing fairly frequent lds.update_rejected events despite this ordering; i.e. Envoy will potentially process an LDS file update before an earlier CDS update. Generally this has proven benign because a subsequent loop manages to be read in the correct order and succeed, but occasionally the final time it discovers new services the writes race and Envoy ends up in a state where the LDS update is rejected, then the CDS update is accepted, leaving the application broken even though the configs on-disk are compatible.

It's not shocking that these writes and reads could race, but it is very unfortunate that Envoy rejects LDS updates that are actually valid with no clear way to recover from this state. Ideally Envoy would fail more gracefully when given an incompatible route, or re-evaluate the LDS config it rejected when it observes a new CDS config.

I'm very much open to suggestions for other ways to implement this, but IMO it's a bug in Envoy that you can enter (and become stuck in) this state.

Repro steps:

Include sample requests, environment, etc. All data and inputs required to reproduce the bug.

Write new clusters and routes to a pair of CDS and LDS files repeatedly, CDS first, and observe rejected updates even though the file contents on-disk are never incompatible with one another.

phlax commented 10 months ago

cc @adisuissa

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

adisuissa commented 9 months ago

I'm not sure about the LDS->CDS dependency as AFAIK it is not required for xDS-subscriptions (i.e., with a gRPC-server that handles ACKs/NACKs), so I'm not sure exactly what you are observing.

I'm guessing that the OS can decide to invoke the watcher callback at any time, and so a race may happen here. One way to mitigate this is to only update the LDS after the CDS update is successful. This can be done by probing the admin endpoint stats, and look at the value of update_time or version (see https://www.envoyproxy.io/docs/envoy/latest/configuration/overview/mgmt_server#xds-subscription-statistics).

dimo414 commented 9 months ago

I'm not sure about the LDS->CDS dependency as AFAIK it is not required for xDS-subscriptions (i.e., with a gRPC-server that handles ACKs/NACKs), so I'm not sure exactly what you are observing.

I have definitely observed an ordering constraint between the two files - if you add a route to an unknown cluster the LDS update is rejected. Conversely adding an unused cluster is fine. But maybe I'm misunderstanding what you're referring to.

I'm guessing that the OS can decide to invoke the watcher callback at any time, and so a race may happen here.

I assume that's what's happening, yeah. It's also possible that the race is in Envoy itself if the LDS and CDS callbacks can run concurrently and the CDS update simply takes longer to commit. Either way though I think Envoy shouldn't assume updates are always delivered in-order.

One way to mitigate this is to only update the LDS after the CDS update is successful.

That makes sense to me, or just re-read the LDS file on any CDS update (it's possible to make an LDS update without a paired CDS update, though I expect that's less common).

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

Can we please turn off the stalebot for this issue?

adisuissa commented 9 months ago

One way to mitigate this is to only update the LDS after the CDS update is successful.

That makes sense to me, or just re-read the LDS file on any CDS update (it's possible to make an LDS update without a paired CDS update, though I expect that's less common).

The xDS-protocol is an eventual consistent protocol, and filesystem rejection is not supported. Also there may be other dependencies, so all configs will need to be loaded on a periodic basis, which sounds not a very good option. The use-case you are describing was soled in the gRPC-based transport protocol by introducing ADS and ensuring that the server will only send an update after the previous updates were ACKed.

What I suggested was mimicking the same behavior in the script that writes the config to the filesystem ,by probing Envoy for the latest update and ensuring it was successfully applied before moving to the next config.

dimo414 commented 9 months ago

You're saying anyone using filesystem xDS needs to implement their own probing, to work around the limitations in the protocol? That's pretty unfortunate. Why can't Envoy observe this invalid state and self-correct?