envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.93k stars 4.8k forks source link

xDS ACK/NACK are not sufficient for data plane health monitoring #35675

Open kyessenov opened 2 months ago

kyessenov commented 2 months ago

Current xDS protocol feedback semantics is too limiting to validate the configuration is effective on the data plane:

It seems the flaw is that the inline ACK/NACK in xDS protocol is not expressive enough to cover all the possible error states. The only value of current ACK/NACK is denying "syntactically" incorrect responses, which can be invalidated immediately but are quite rare. I believe a better architecture is needed, as mentioned in https://github.com/envoyproxy/envoy/issues/11396. We do not need the full complexity of reverse subscriptions, but the key is that the status is sent asynchronously, per resource, and independently from xDS requests.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

markdroth commented 3 weeks ago

I think that for inter-dependent resources, the ACK/NACK can't represent a per-resource state. For example, if there's a route config resource that might work with some listeners but not other listeners, then it's not a problem with that individual route config resource; it's a problem with the overall config that tried to put that route config together with a listener that it can't work for. But the route config itself is not invalid -- and in fact, it's entirely possible that there's also another listener using the same route config where the combination is totally valid.

kyessenov commented 3 weeks ago

@markdroth Yes, that's one case where ACK/NACK does not reflect the overall config consistency. It seems we're using NACKs for dual purposes - a local version where it represents an individual resource correctness, and a global version where it indicates some consistency (or system dependency) issue. The problem is that it cannot do the latter well, and that's what users of xDS assume it should do.

markdroth commented 3 weeks ago

In practice, I think that's a misunderstanding on the part of the user. The current mechanism actually doesn't attempt to reflect overall config health at all; it reflects only validity of individual resources in isolation.

So I think the ask here is for a mechanism to reflect overall config health. I think that would need to be something different than the current ACK/NACK mechanism.

kyessenov commented 3 weeks ago

Agree, we need a data plane health signal that is not based on CSDS/xDS NACKs. This is very important for things like readiness probes (can traffic be shifted?), policy enforcement (when is my denial policy effective?), and authentication (is my key rotated?), and currently builds upon NACKs (https://cloud.google.com/service-mesh/docs/service-routing/client-status). It is probably better to utilize a monitoring solution here using metrics, and we should provide a schema for xDS clients as a recommendation.