envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.76k stars 4.76k forks source link

Delta SDS incorrectly sends an error response for an empty response #32832

Closed howardjohn closed 4 months ago

howardjohn commented 6 months ago

Title: Delta SDS incorrectly sends an error response for an empty response

Description: Control plan sends resources:[], removed:[some resource]. Envoy rejects with 2024-03-11T23:35:13.641010Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138 gRPC config for type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.Secret rejected: Missing SDS resources for kubernetes://sds-credential in onConfigUpdate() thread=25

This comes from https://github.com/envoyproxy/envoy/blob/708fa7b4d8269372fdac39b11caf2a3bf7b18d53/source/common/secret/sds_api.cc#L157 which only considers SotW

Repro steps: I don't have a trivial Envoy SDS setup for this in pure envoy, but its pretty simple in Istio. Just create something like

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: echo
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    hosts:
    - "*"
    tls:
      credentialName: sds-credential
      mode: SIMPLE

With istio latest (which uses delta xds now)

nezdolik commented 6 months ago

cc @adisuissa @htuch

adisuissa commented 6 months ago

This seems to be a dup of #24373.

keithmattix commented 6 months ago

I can try to take this if it's not claimed

htuch commented 6 months ago

@keithmattix I'll close this as dupe and assign you https://github.com/envoyproxy/envoy/issues/24373

keithmattix commented 6 months ago

@howardjohn I think the Istio repro in the original description is expected. If the secret is not supplied by the SDS server, then the client should indeed error, no?

howardjohn commented 6 months ago

a NACK from lack of resource is not expected. it should stay warming.

but even if we had previously sent it, any removal is just entirely rejected

keithmattix commented 6 months ago

but even if we had previously sent it, any removal is just entirely rejected

Right, I've confirmed this and am iterating on the linked PR; I'm looking to verify the repro behavior with Istio

a NACK from lack of resource is not expected. it should stay warming.

Ah I see; from the spec (emphasis mine):

Resources names of resources that have be deleted and to be removed from the xDS Client. Removed resources for missing resources can be ignored.

So you're saying the desired behavior is:

  1. if cluster is still warming and secret is removed, the SDS subscription should ignore the removal (NOT NACK)
  2. If cluster is already ready (and serving traffic), the SDS subscription should also ignore that removal because the owning listener/cluster should be updated as well

I agree with 1 but on 2, how can the SDS subscription guarantee that the cluster/listener are being updated? What if the server is saying "this secret no longer exists" (e.g. a cert rotation failed); what should that do to the owning cluster/listener? Does the SDS subscription need to do anything at all (I think this is current behavior)?

howardjohn commented 6 months ago

IMO I think the delete should be ignored entirely -- will try to explain more in a bit or you can look at the recent ecds issue

On Wed, Mar 20, 2024, 8:54 PM Keith Mattix II @.***> wrote:

but even if we had previously sent it, any removal is just entirely rejected

Right, I've confirmed this and am iterating on the linked PR; I'm looking to verify the repro behavior with Istio

a NACK from lack of resource is not expected. it should stay warming.

Ah I see; from the spec https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/discovery/v3/discovery.proto#envoy-v3-api-field-service-discovery-v3-deltadiscoveryresponse-removed-resources (emphasis mine):

Resources names of resources that have be deleted and to be removed from the xDS Client. Removed resources for missing resources can be ignored.

So you're saying the desired behavior is:

  1. if cluster is still warming and secret is removed, the SDS subscription should ignore the removal (NOT NACK)
  2. If cluster is already ready (and serving traffic), the SDS subscription should also ignore that removal because the owning listener/cluster should be updated as well

I agree with 1 but on 2, how can the SDS subscription guarantee that the cluster/listener are being updated? What if the server is saying "this secret no longer exists" (e.g. a cert rotation failed); what should that do to the owning cluster/listener? Does the SDS subscription need to do anything at all (I think this is current behavior)?

— Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/32832#issuecomment-2010505469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXIM7SHLFUTTAPOOBW3YZHSOPAVCNFSM6AAAAABERG5OGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJQGUYDKNBWHE . You are receiving this because you were mentioned.Message ID: @.***>

keithmattix commented 6 months ago

I'm familiar with the ecds issue; I'm just not sure which is the more "correct" approach from a protocol perspective. What you're suggesting would mean any removal that isn't LDS or CDS essentially has no effect. From a pure protocol perspective, I would think that an xDS server should be able to indicate that dependent resources are missing/removed and have that mean something

howardjohn commented 6 months ago

@keithmattix IMO there are 2 usable implementations:

  1. There is update and remove semantics (as today), and remove of dependent resources is ignored
  2. There is update, remove, and "requested thing not present" semantics. Remove dependant resource is respected

In (2), we would just never send Remove for dependent types. But if we just do that today, then we cannot tell Envoy a resource it requested is not available, which means it will be warming instead of just failing which is unexpected.

keithmattix commented 6 months ago

Ah I see; unless we xDS adopted (2)'s semantics, we wouldn't be able to express "not available". Fair point; I ended up coding up "delete ignored" in #32961, so I think this will solve Istio's issue

kyessenov commented 6 months ago

Ignoring delete can be dangerous in case of compromised secrets. Does delete trigger an eventual release of the secret? Or will it potentially be used indefinitely after the deletion?

keithmattix commented 6 months ago

Wouldn't that require an update instead of a removal though? Of the cluster is not supposed to use a secret in the interim period before a new secret is available, modify the cluster right?

kyessenov commented 6 months ago

@keithmattix I think it depends. I am mostly dis-agreeing with the base assumption that traffic must not be disrupted at all costs. Sometimes, you do want to break traffic to mitigate an ongoing attack, or for compliance reasons.

keithmattix commented 6 months ago

Sure, but IIUC, not ignoring removal leaves control planes in a tough spot because they can't signal to the data plane that a requested secret is not/available without breaking traffic. If the CP wants to break traffic, it can just remove the secret from the cluster right?

htuch commented 5 months ago

My take here is that semantics today are as pointed out above, we don't support removal of singleton subscriptions (SDS, EDS, etc.) independent of the referencing resource (cluster, listener). To @kyessenov point about dangers, we can either rely on the control plane to enforce the correct behavior during secret removal or make a change in semantics as suggested by @howardjohn. It seems reasoable to rely on control plane updating cluster/listener in these cases? Reopening as this is not resolved.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.