Open howardjohn opened 6 months ago
cc @keithmattix you may be interested
Yeah I'll take a look. I wonder if I can /assign
/assign @keithmattix
CC @adisuissa This sounds like a reliability risk. Config must always be live.
Definitely sounds like a bug, albeit a very strange one... On reconnection Envoy should iterate over all the requested resources (including the wildcards) regardless of the previous one being NACK'd or ACK'd. @keithmattix thanks for looking into this. I suggest trying to repro using this test and modifying it accordingly.
I couldn't reproduce this bug in the ads integration test @adisuissa linked, I think because the streams are fake (can't send a goaway). That being said, this still could be an issue limited to istio; I still need to do some investigation
The test is here if folks want to double check me: https://github.com/keithmattix/envoy/commit/d3c3e48cc3f750c55bb291240e066798a6f11151#diff-527517d4e9ea7abd055b4e4f80e4d2ed3af812e8b983e175f2c3d784a5bb7cbbR582
The test is here if folks want to double check me: keithmattix@d3c3e48#diff-527517d4e9ea7abd055b4e4f80e4d2ed3af812e8b983e175f2c3d784a5bb7cbbR582
Thanks, I think this is a valid test. I do wonder what makes the CDS in the original bug so unique (I would've expected LDS to behave the same). If it is possible to run the test mentioned in the first comment with Envoy's trace log level, that may shed some more light on what's going on.
I think the issue here is NACKs that result in permanently warming clusters. I'm still learning the xDS details, but here's my current theory of what's happening:
NewGrpcMuxImpl::whoWantsToSendDiscoveryRequest
loops through the subscription order (Clusters, Endpoints, etc.) to see which requests to send, picking requests where subscriptionUpdatePending returns true.requested_resource_state
for the CDS subscription is still empty because the cluster is NACK'd and is permanently warming until another xDS update occurs. in_initial_legacy_wildcard_
. Again, assuming subscriptions (and their state) are durable across connection resets, this initial legacy wildcard would get set to false. Therefore, clusters are never requested and are skipped each time that loop is run, therefore causing the control plane to never receive a CDS request.I'm only like 30% confident in this and I'm sure there are holes, but I do have a hunch that this has something to do with perpetually warming clusters.
/cc @adisuissa have you had a chance to take a look at this yet?
More info -- maybe needs to be its own issue, but I will start here.
The following occurs in both delta and normal XDS:
This seems wrong. The proxy should NACK and mark as failed
Stats at (2):
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 1
cluster_manager.cds.update_success: 0
Stats at (4):
cluster_manager.cds.update_failure: 1
cluster_manager.cds.update_rejected: 1
cluster_manager.cds.update_success: 1
I am a bit confused how I am now seeing different behavior even for delta. I reproduced on master back to Envoy 1.29 so I am pretty confused how I am this now but "never request again" before
Title: After cluster NACK, reconnection never requests CDS again
Description: Originally reported to Istio in: https://github.com/istio/istio/issues/50670.
What we are seeing is a sequence:
Repro steps: https://github.com/istio/istio/issues/50670 has steps with Istio. I tried to reproduce locally with go-control-plane and was unable to for some reason
Logs: istiod.txt envoy.txt