envoyproxy / go-control-plane

Go implementation of data-plane-api
Apache License 2.0
1.53k stars 515 forks source link

Order of responses is not guaranteed on ADS stream #218

Closed gnz00 closed 3 years ago

gnz00 commented 5 years ago

It is impossible in the current implementation to ensure the order of responses on an ADS stream. I believe it is due to the un-ordered nature of select in server.go: https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/server.go#L198-L248.

Is there a known solution to this? Currently, I'm introducing an artificial delay to ensure the select sends the responses in the proper order. I would like to have the server buffer a batch of responses and transactionally send them, waiting for an ACK between each response. This should be doable using the Cache and Callbacks implementations, but there might be a better way to do this through a callback mechanism in the server implementation.

kyessenov commented 5 years ago

This is mostly because of the order of requests is not guaranteed. E.g. if listeners have 5 route tables, then Envoy is going to ask for 1,2,3,4, and 5 tables in a sequence.

The plan originally was to make each snapshot transactionally consistent, wait until the entire snapshot is acked, and then make progress towards the next snapshot. This also allows safe rollouts since you can introduce a union snapshot in the middle than can be backtracked safely: e.g. snapshot S1, then S1 union S2, then S2. It also supports the incremental xDS which hopefully becomes a first class xDS in Envoy.

gnz00 commented 5 years ago

It was my understanding that the purpose of ADS was to allow for consistent ordering of responses:

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#eventual-consistency-considerations

I agree that ordering should be scoped to a particular version, where version is a global version representing the underlying service discovery update, such that you can implement transactions. I haven’t thought much about unioning, but seems that also wouldn’t be supported in the existing implementation.

Is there work in progress already? I see a few original issues in GitHub from 2018 for this. Seems to me we could replace the existing select logic with a single channel that receives a global version of an update and invokes 4 callbacks to get each corresponding resource for that global version. Then the resources can be validated and sent in order as long as they are consistent.

Is incremental being implemented as well?

On Aug 22, 2019, at 3:45 PM, Kuat notifications@github.com wrote:

This is mostly because of the order of requests is not guaranteed. E.g. if listeners have 5 route tables, then Envoy is going to ask for 1,2,3,4, and 5 tables in a sequence.

The plan originally was to make each snapshot transactionally consistent, wait until the entire snapshot is acked, and then make progress towards the next snapshot. This also allows safe rollouts since you can introduce a union snapshot in the middle than can be backtracked safely: e.g. snapshot S1, then S1 union S2, then S2. It also supports the incremental xDS which hopefully becomes a first class xDS in Envoy.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kyessenov commented 5 years ago

Incremental is still being implemented, and it's probably usable. I haven't seen a production use of incremental xDS though, so please confirm with envoy developers whether it's ok to use.

ADS is a mechanism to enable transactional config update, but it doesn't give it for free. All it does is serialize all updates on a single pipe, and it's up to the server to send config so that it can support rollbacks. The union idea I mentioned is one such approach, since CDS/LDS are total sets and removing elements during the update from the set will cause a brief down-time (404s). So you have to have a state that has both old and new elements in the middle.

The transactional config update is not implemented in this repo. It's mostly due to lack of hands, and the perceived complexity of the solution on the management server which would have to coordinate all xDSes. There's been some improvements in Envoy-side to provide "warming" of clusters and listeners, so that config is stalled until all dependent resources are provided (search for warming in xDS). This enables an eventually consistent model, but also opens a possibility of locking the proxies that ask for deleted RDS/EDS.

kyessenov commented 5 years ago

Incremental xDS has finally merged into envoy, so it should be possible to push delta updates once that gets implemented here.

gnz00 commented 5 years ago

Thanks for the update! We have our own control plane that does some ordering of responses that we’re thinking about open sourcing. Guess it’s time to implement Delta Discovery service too

On Wed, Oct 2, 2019 at 3:10 PM Kuat notifications@github.com wrote:

Incremental xDS has finally merged into envoy, so it should be possible to push delta updates once that gets implemented here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/go-control-plane/issues/218?email_source=notifications&email_token=AAW6VM2CVMXI7ON7D7E2WS3QMT5Z7A5CNFSM4IOQORD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAGAO4I#issuecomment-537659249, or mute the thread https://github.com/notifications/unsubscribe-auth/AAW6VM5HX3SBOOUZJRQ7K3LQMT5Z7ANCNFSM4IOQORDQ .

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

AmitKatyal-Sophos commented 2 years ago

@gnz00 Could you please share the details of your control plane repo which handles the ordering.

gnz00 commented 2 years ago

@gnz00 Could you please share the details of your control plane repo which handles the ordering.

My apologies, I no longer have access to the source. It wasn't very well abstracted from our service discovery mechanism. I believe with Delta support you might not need to use ADS any more - or you could implement the single channel mechanism as described above.