envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.32k stars 4.7k forks source link

After cluster NACK, reconnection never requests CDS again #34334

Open howardjohn opened 1 month ago

howardjohn commented 1 month ago

Title: After cluster NACK, reconnection never requests CDS again

Description: Originally reported to Istio in: https://github.com/istio/istio/issues/50670.

What we are seeing is a sequence:

Repro steps: https://github.com/istio/istio/issues/50670 has steps with Istio. I tried to reproduce locally with go-control-plane and was unable to for some reason

Logs: istiod.txt envoy.txt

howardjohn commented 1 month ago

cc @keithmattix you may be interested

keithmattix commented 1 month ago

Yeah I'll take a look. I wonder if I can /assign

keithmattix commented 1 month ago

/assign @keithmattix

kyessenov commented 1 month ago

CC @adisuissa This sounds like a reliability risk. Config must always be live.

adisuissa commented 1 month ago

Definitely sounds like a bug, albeit a very strange one... On reconnection Envoy should iterate over all the requested resources (including the wildcards) regardless of the previous one being NACK'd or ACK'd. @keithmattix thanks for looking into this. I suggest trying to repro using this test and modifying it accordingly.

keithmattix commented 1 month ago

I couldn't reproduce this bug in the ads integration test @adisuissa linked, I think because the streams are fake (can't send a goaway). That being said, this still could be an issue limited to istio; I still need to do some investigation

keithmattix commented 1 month ago

The test is here if folks want to double check me: https://github.com/keithmattix/envoy/commit/d3c3e48cc3f750c55bb291240e066798a6f11151#diff-527517d4e9ea7abd055b4e4f80e4d2ed3af812e8b983e175f2c3d784a5bb7cbbR582

adisuissa commented 1 month ago

The test is here if folks want to double check me: keithmattix@d3c3e48#diff-527517d4e9ea7abd055b4e4f80e4d2ed3af812e8b983e175f2c3d784a5bb7cbbR582

Thanks, I think this is a valid test. I do wonder what makes the CDS in the original bug so unique (I would've expected LDS to behave the same). If it is possible to run the test mentioned in the first comment with Envoy's trace log level, that may shed some more light on what's going on.

keithmattix commented 1 month ago

I think the issue here is NACKs that result in permanently warming clusters. I'm still learning the xDS details, but here's my current theory of what's happening:

  1. This assumes that subscription objects are durable across xds connection resets
  2. On reconnect, NewGrpcMuxImpl::whoWantsToSendDiscoveryRequest loops through the subscription order (Clusters, Endpoints, etc.) to see which requests to send, picking requests where subscriptionUpdatePending returns true.
  3. Within subscriptionUpdatePending, there are no new names added or removed because the client doesn't need to know about any more clusters. Furthermore, no requests have been sent yet in this mux stream, so we get to L123.
  4. Next, we check if the requested resource state is empty. Now, this is where my understanding is the weakest, but theoretically, the requested_resource_state for the CDS subscription is still empty because the cluster is NACK'd and is permanently warming until another xDS update occurs.
  5. Lastly, L131 returns the value of in_initial_legacy_wildcard_. Again, assuming subscriptions (and their state) are durable across connection resets, this initial legacy wildcard would get set to false. Therefore, clusters are never requested and are skipped each time that loop is run, therefore causing the control plane to never receive a CDS request.

I'm only like 30% confident in this and I'm sure there are holes, but I do have a hunch that this has something to do with perpetually warming clusters.

keithmattix commented 1 month ago

/cc @adisuissa have you had a chance to take a look at this yet?

howardjohn commented 4 weeks ago

More info -- maybe needs to be its own issue, but I will start here.

The following occurs in both delta and normal XDS:

  1. Send 2 CDS resources, 1 valid and 1 invalid.
  2. Proxy NACKs as expected
  3. Restart control plane, proxy reconnects. It does subscribe to CDS.
  4. Proxy ACKs and marks as success

This seems wrong. The proxy should NACK and mark as failed

Stats at (2):

cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 1
cluster_manager.cds.update_success: 0

Stats at (4):

cluster_manager.cds.update_failure: 1
cluster_manager.cds.update_rejected: 1
cluster_manager.cds.update_success: 1

I am a bit confused how I am now seeing different behavior even for delta. I reproduced on master back to Envoy 1.29 so I am pretty confused how I am this now but "never request again" before