envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.67k stars 4.75k forks source link

unified format API of stateful session header/cookie value #30678

Open wbpcode opened 10 months ago

wbpcode commented 10 months ago

Title: format API of stateful session header/cookie value

Description: Envoy added new proto format session cookie in the https://github.com/envoyproxy/envoy/pull/26761. And gPRC also add a new format for the weighted clusters session support at https://github.com/grpc/proposal/blob/master/A60-xds-stateful-session-affinity-weighted-clusters.md

At the initial design of stateful session filter, I just want provide a simple way to override the upstream load balancing and I didn't thought it would be used in more complex scenarios. So, only simple address string is used in the session cookie in the initial design.

Considering latest changes to the stateful session filter (in both Envoy and gRPC), I think we should give a formal discussion about following questions:

  1. should a unified format be used for session header/cookie?
  2. should the format be part of our standard xDS API?

[optional Relevant Links:]

Any extra documentation required to understand the issue.

wbpcode commented 10 months ago

cc @cpakulski cc @markdroth

markdroth commented 10 months ago

For context, see discussion in https://github.com/envoyproxy/envoy/pull/30573#discussion_r1375666824.

markdroth commented 10 months ago

A couple of follow-ups to our earlier discussion, just to record them here:

First, I would like to make sure that going forward, we try to keep Envoy and gRPC in the loop on changes to this mechanism so that things stay in sync. We (gRPC) did talk with @adisuissa as we designed gRFC A60, and I believe his intent was to implement this feature in Envoy at some point, since I think we have some customers that are going to want that feature in Envoy. I guess we should have disseminated the info a little more broadly, though, and I'll try to make sure we do that in the future. I'd appreciate it if Envoy could try to keep us gRPC folks keep in the loop on changes to this mechanism in the future as well.

Second, to follow up on my question from the earlier discussion:

[gRPC] had intentionally implemented the same cookie format as Envoy, just to have consistent implementations. Now the two implementations are different. We'll need to consider whether or not this is important.

We've discussed this on our end, and this probably isn't a big deal, since the main case where we'd need interoperability is for service mesh customers that are switching between sidecar and proxyless gRPC, and since the state is stored on the client, none of the state would be retained in that kind of client restart anyway.

So I think the main issue here is the tension between the desire for clients to generate the initial session state directly vs. the ability for us to change the format of the session state as we add features.

wbpcode commented 10 months ago

We (gRPC) did talk with @adisuissa as we designed gRFC A60, and I believe his intent was to implement this feature in Envoy at some point, since I think we have some customers that are going to want that feature in Envoy.

Agree. Although it wouldn't be a simple PR because it's a little more complicated to override cluster in current Envoy code base.

Rather than proto, I prefer the format used by the gRPC. It's easy to read and friendly to test/mock by http clients (like curl) without proto. And it's completely compatible to the initial simple format. And it's also extendable.

206.12.3.6:8080;cluster:cloud-internal-istio:cloud_mp_635862331669_807571734533927564
markdroth commented 10 months ago

I agree that the non-protobuf format is easier to understand and debug, as well as being more performant. And as I mentioned in the earlier discussion, I really don't think the data plane should be responsible for working around a bug in the client application, so I would be in favor of reverting #26761 and not trying to fix #22475.

I should also mention that gRPC recently made another change to the session state format as part of adding dualstack endpoint support, as described in gRFC A61 (design still pending final approval, but already implemented in gRPC C-core). The idea of this design is that each endpoint can have multiple IP addresses, and for the purposes of SSA, we want to consider any of those endpoints to be the same, so we need to encode the list of all IP addresses in the session state. We discussed this with @adisuissa and @pradeepcrao, who are working on the Envoy implementation (initial code was in #27881), although I suspect that they haven't yet implemented the SSA piece of this. There is an Envoy-side design for this feature as well. The new cookie format in gRPC is a comma-delimited list of addresses, followed by a semicolon, and then the cluster name:

ip1:port,ip2:port,...;cluster

In any case, I think the core question here is still about whether we want to support the session state format as a public API. If we don't, then it really doesn't matter if gRPC and Envoy have the same session state format or not. But if we do, then we need to be consistent and provide some backward-compatibility guarantees.

adisuissa commented 10 months ago

As @markdroth there is going to be support for two known use-cases:

  1. Adding cluster-names - supports pinning to a specific cluster - such as weighted clusters.
  2. Dual/Multiple IP addresses per endpoint. and I'm assuming there will be other features that will need to be supported in the future.

In any case, I think the core question here is still about whether we want to support the session state format as a public API. If we don't, then it really doesn't matter if gRPC and Envoy have the same session state format or not. But if we do, then we need to be consistent and provide some backward-compatibility guarantees.

Indeed this is the core question. What's the value of having the same format - IMHO it allows some requests of a session to be proxied through an Envoy and some through a gRPC and expect seeing the same session affinity. But this is not a valid case IMO. The cost of having a unified format is much higher (for example, how does one handle different versions of proxies and such). If there were some known header with an RFC that describes the format, I'd probably be in favor of supporting that, but I am not aware of such.

It should be clear that the SSA is meant for a proxy to set some value in a cookie/header, and that proxy will be the one to parse that value. If the idea is to support downstream-provided header (e.g., x-envoy-prefer-host-endpoint) to change the load-balancing algorithms (no matter if a session is required), then that feature should be explicitly added by a proxy.

cpakulski commented 10 months ago

There are some very important points above which we should address separately:

I am afraid that such header must more or less follow Envoy's internal logic, so it must encode cluster to address weighted cluster issue, so eventually it will probably end up in exact form as internally generated cookie.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

markdroth commented 8 months ago

We recently realized that there's another small ambiguity in the SSA semantics: Is the TTL meant to indicate (1) the max length of the entire session or (2) the max time between requests within the session? The current behavior in Envoy is to not update the cookie unless the upstream host changes, which implies (1). Unfortunately, in the gRPC dualstack design (now finalized in gRFC A61; see the SSA section), we made it such that whenever we switch from one address to another address for the same endpoint, we will update the cookie, thus resetting the TTL, and that behavior implies (2). There are some performance implications to which way we go here, so we should carefully consider which behavior we actually want.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

costinm commented 7 months ago

We have 2 bug reports from Istio users about this - the public docs for the filter are saying 'ttl=0 means session cookie' ( i.e. last until the browser tab is closed, not persisted ). When used in browsers, session cookies are exempt from GDPR ( AFAIK) so it's very important to support them in browser.

I have no strong opinions about the format of the cookie - except that in ingress use cases it should be encrypted/signed for privacy/security reasons. But the behavior around expiration and 'session' should not diverge from common practices - including renewal of the cookie when it is used while still valid.