Open dimo414 opened 10 months ago
cc @adisuissa
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
I'm not sure about the LDS->CDS dependency as AFAIK it is not required for xDS-subscriptions (i.e., with a gRPC-server that handles ACKs/NACKs), so I'm not sure exactly what you are observing.
I'm guessing that the OS can decide to invoke the watcher callback at any time, and so a race may happen here.
One way to mitigate this is to only update the LDS after the CDS update is successful. This can be done by probing the admin endpoint stats, and look at the value of update_time
or version
(see https://www.envoyproxy.io/docs/envoy/latest/configuration/overview/mgmt_server#xds-subscription-statistics).
I'm not sure about the LDS->CDS dependency as AFAIK it is not required for xDS-subscriptions (i.e., with a gRPC-server that handles ACKs/NACKs), so I'm not sure exactly what you are observing.
I have definitely observed an ordering constraint between the two files - if you add a route to an unknown cluster the LDS update is rejected. Conversely adding an unused cluster is fine. But maybe I'm misunderstanding what you're referring to.
I'm guessing that the OS can decide to invoke the watcher callback at any time, and so a race may happen here.
I assume that's what's happening, yeah. It's also possible that the race is in Envoy itself if the LDS and CDS callbacks can run concurrently and the CDS update simply takes longer to commit. Either way though I think Envoy shouldn't assume updates are always delivered in-order.
One way to mitigate this is to only update the LDS after the CDS update is successful.
That makes sense to me, or just re-read the LDS file on any CDS update (it's possible to make an LDS update without a paired CDS update, though I expect that's less common).
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
Can we please turn off the stalebot for this issue?
One way to mitigate this is to only update the LDS after the CDS update is successful.
That makes sense to me, or just re-read the LDS file on any CDS update (it's possible to make an LDS update without a paired CDS update, though I expect that's less common).
The xDS-protocol is an eventual consistent protocol, and filesystem rejection is not supported. Also there may be other dependencies, so all configs will need to be loaded on a periodic basis, which sounds not a very good option. The use-case you are describing was soled in the gRPC-based transport protocol by introducing ADS and ensuring that the server will only send an update after the previous updates were ACKed.
What I suggested was mimicking the same behavior in the script that writes the config to the filesystem ,by probing Envoy for the latest update and ensuring it was successfully applied before moving to the next config.
You're saying anyone using filesystem xDS needs to implement their own probing, to work around the limitations in the protocol? That's pretty unfortunate. Why can't Envoy observe this invalid state and self-correct?
I am observing occasional situations where my application unexpectedly fails to come online correctly, as requests to backends that are up, healthy, and registered with Envoy via the filesystem xDS are rejected. Here's some simplified code from the startup program:
The note there might give it away, but I'm seeing fairly frequent
lds.update_rejected
events despite this ordering; i.e. Envoy will potentially process an LDS file update before an earlier CDS update. Generally this has proven benign because a subsequent loop manages to be read in the correct order and succeed, but occasionally the final time it discovers new services the writes race and Envoy ends up in a state where the LDS update is rejected, then the CDS update is accepted, leaving the application broken even though the configs on-disk are compatible.It's not shocking that these writes and reads could race, but it is very unfortunate that Envoy rejects LDS updates that are actually valid with no clear way to recover from this state. Ideally Envoy would fail more gracefully when given an incompatible route, or re-evaluate the LDS config it rejected when it observes a new CDS config.
I'm very much open to suggestions for other ways to implement this, but IMO it's a bug in Envoy that you can enter (and become stuck in) this state.
Repro steps:
Write new clusters and routes to a pair of CDS and LDS files repeatedly, CDS first, and observe rejected updates even though the file contents on-disk are never incompatible with one another.