Support for resource readiness feedback in XDS

ivitjuk commented 5 years ago

Title: Support for resource readiness feedback in XDS.

Description: We are experiencing temporary 503 responses from Envoy when adding a new route. We managed to link those to a scenario where the new route also introduces a new cluster. When the new cluster is added it enters the warming state. As we send the routes immediately after sending clusters, there is a short period of time when requests are being routed to the cluster that is still in the warming state. As a consequence, 503 responses are returned.

We understand we don't respect following requirement from [1]: "management plane must ensure that clusters referenced by a route are in place, before pushing the updates for a route.", but it's not clear to us what is the correct protocol for determining if "clusters referenced by a route are in place".

In our experiments it looks like XDS ACKs to cluster update messages were sent from Envoy before cluster warming period is done, making them unsuitable for the purpose of detecting cluster readiness. Can we get a confirmation/refutation of this behavior?

If this is indeed the case, we would like to propose an extension of the XDS protocol to support a readiness feedback for the resources that require warming up (e.g. CDS and LDS).

[1] https://github.com/envoyproxy/data-plane-api/blob/master/XDS_PROTOCOL.md#eventual-consistency-considerations

mattklein123 commented 5 years ago

@ivitjuk this sounds like a useful addition for the ACK process if you want to make a proposal. cc @htuch

htuch commented 5 years ago

A couple of things:

If you sequence as per https://github.com/envoyproxy/data-plane-api/blob/master/XDS_PROTOCOL.md#eventual-consistency-considerations, as [CDS, EDS, RDS], if everything ACKs, the clusters should be warm by time the RDS update arrives. This really requires ADS or cross-management server coordination in practice.
We could make an extension to allow for resource status to be queried. I think we would want to build this into the incremental xDS variant, as it has better support for fine grained resource addressing; a limitation of the current state-of-the-world updates is that status such as ACK is on an entire update granularity, not a specific resource.

ivitjuk commented 5 years ago

@htuch in the point 1, when you say "...the clusters should be warm by time the RDS update arrives", is this because there is explicit control in Envoy that ensures this, or do we just opportunistically rely on timing to work for us?

The reason I'm asking is that we do have ADS which tries to send resources in order, and from what I see in the Envoy logs, order was correct. Nevertheless, it doesn't seem like Envoy ensures clusters are ready by the time it starts applying routes that depend on them.

Take a look at the following excerpt from an Envoy log. It's a single update cycle triggered by a route addition. Added route depends on a new cluster, so CDS update was sent too. Order in the log seems to be correct (CDS, LDS and finally RDS):

Cluster warming started on the line 2
RDS was received on line 4
Request targeting new route was received and the new route was applied. Because the cluster is not ready (line 5) 503 was returned (line 6)
On line 7, cluster warming is done and it's ready to receive traffic. After that, everything works as expected.

$ grep -E "starting warming|warming cluster.*complete|:status.*503|unknown cluster|Received gRPC message for type" /tmp/bad_envoy_log.txt | grep -A1 "2018-11-30 17:58:20" | cat -n
     1  [2018-11-30 17:58:20.333][22][debug][upstream] source/common/config/grpc_mux_impl.cc:168] Received gRPC message for type.googleapis.com/envoy.api.v2.Cluster at version 280
     2  [2018-11-30 17:58:20.337][22][info][upstream] source/common/upstream/cluster_manager_impl.cc:500] add/update cluster cds|egress|test_mesh|front_node|mid_node_b_2|http|8022|9 starting warming
     3  [2018-11-30 17:58:20.338][22][debug][upstream] source/common/config/grpc_mux_impl.cc:168] Received gRPC message for type.googleapis.com/envoy.api.v2.Listener at version 280
     4  [2018-11-30 17:58:20.350][22][debug][upstream] source/common/config/grpc_mux_impl.cc:168] Received gRPC message for type.googleapis.com/envoy.api.v2.RouteConfiguration at version 280
     5  [2018-11-30 17:58:20.401][66][debug][router] source/common/router/router.cc:241] [C26410][S12949414697162957557] unknown cluster 'cds|egress|test_mesh|front_node|mid_node_b_2|http|8022|9'
     6  ':status', '503'
     7  [2018-11-30 17:58:20.494][22][info][upstream] source/common/upstream/cluster_manager_impl.cc:512] warming cluster cds|egress|test_mesh|front_node|mid_node_b_2|http|8022|9 complete
     8  [2018-11-30 17:58:27.360][22][debug][upstream] source/common/config/grpc_mux_impl.cc:168] Received gRPC message for type.googleapis.com/envoy.api.v2.Cluster at version 281

htuch commented 5 years ago

@ivitjuk you show in your log [CDS, LDS, RDS], where is EDS? you will need an EDS update for the warming to complete? I think you need to have [CDS, EDS, RDS] to make this work.

ivitjuk commented 5 years ago

This specific setup doesn't use EDS, it uses LOGICAL_DNS. Following is the CDS config corresponding to the logs I previously pasted:

name: "cds|egress|test_mesh|front_node|mid_node_b_2|http|8022|9"                                     
type: LOGICAL_DNS                                                                                      
connect_timeout {                                                                                      
  seconds: 1                                                                                           
}                                                                                                      
http2_protocol_options {                                                                               
}                                                                                                      
load_assignment {                                                                                      
  cluster_name: "cds|egress|test_mesh|front_node|mid_node_b_2|http|8022|9"                             
  endpoints {                                                                                          
    lb_endpoints {                                                                                     
      endpoint {                                                                                       
        address {                                                                                      
          socket_address {                                                                             
            address: "mid_b_2.test_mesh.local"                                                         
            port_value: 8022                                                                           
          }                                                                                            
        }                                                                                              
      }                                                                                                
    }                                                                                                  
  }                                                                                                    
}

htuch commented 5 years ago

@ivitjuk yeah, logical DNS is going to take arbitrary time to warm, so sequencing alone won't cut it.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

htuch commented 5 years ago

@ivitjuk do you want to try extend the incremental xDS design to accommodate this? See https://github.com/envoyproxy/envoy/blob/master/api/XDS_PROTOCOL.md#incremental-xds and https://github.com/envoyproxy/envoy/blob/7c5badba34e8e4e5115b81776bafc7bcaa3a5834/api/envoy/api/v2/discovery.proto#L99.

ivitjuk commented 5 years ago

htuch@ we need to fix the 503 issue we are seeing and we will need a deeper dive into the incremental xDS to understand whether it's a feasible approach for us or not. Would appreciate if we could keep this issue open until we get this properly prioritized. I created a bug fix on our side in https://github.com/awslabs/aws-app-mesh-examples/issues/64

htuch commented 5 years ago

@ivitjuk Ack. Incremental xDS is going to play out over the next quarter or so, I think you'll need an interim workaround. In general, we don't have a good story for better-than-eventual consistency if you use logical DNS rather than EDS as things stand today.

As a super hack, you could poll the admin endpoint and look at /clusters potentially to see what has been resolved, but I don't recommend this as a production solution, the admin endpoint is not designed to play a role as a sequencing step on the control plane.

We can keep the issue open, the question really is what do we want to do short term vs. building into incremental xDS? Maybe we should have some service for this in any case? Our general approach so far, e.g. with ADS, has been to try and find low cost solutions for better-than-eventual consistency, since Envoy is mostly geared towards eventual consistency use cases today.

ivitjuk commented 5 years ago

@htuch Before listing the ideas that came to my mind, let me first lay out the problem we are trying to address in a single sentence:

When it receives a new route, Envoy should only start matching the route once the cluster route depends on becomes ready.

Solutions that I had in mind are:

Modify Envoy to internally take care of this constraint. Looking at the provided logs, it seems to me that Envoy has all the state needed for that.
Change the xDS protocol to send CDS ACKs only when a cluster becomes warm. ADS would then have to make sure RDS is sent to Envoy only after CDS ACK is received from Envoy. Looking at the logs, it seems like ACK is today used only as an acknowledgement of the message reception.
Introduce some other channel in the xDS protocol for sending resource readiness state back to the ADS server. This could be an option if changing the ACK semantics is not desirable. Similarly to 2., ADS would have to wait for CDS readiness notification to arrive before sending RDS to Envoy.

Option 1. sounds like the simplest approach to me. I can also see there is a similar validate_clusters option in the route config. However, it is only concerned with the non-existent clusters. Could we extend this to non-ready clusters too? I think this new rule should only be applied when a route is initially added. 503 should still be returned later if the cluster transitions from read to non-ready state.

Regarding your suggestion to extend the incremental xDS design, I believe that would require incorporating item 2. into the designed in some way.

mattklein123 commented 5 years ago

1) is "easy" if you are OK with the RDS update failing (with a hopefully useful error message). Meaning, we can definitely check if a cluster is both ready and warm I think pretty easily, and fail if not, with a new validate type. Is this acceptable to you?

2) I think this also does have value, and would not require incremental xDS. I could see general value in this, for listeners also.

3) Agree this is possible, but it seems overkill to me and I think either 1 or 2 would work (or both).

@htuch thoughts?

htuch commented 5 years ago

I think (1) is the best option here, if @ivitjuk is good with the existing definition of warm; specifically it doesn't say that we have a valid set of hosts loaded by EDS, simply that one EDS update has occurred. We might have a warm cluster with an empty set of hosts returned by the server, or a set of hosts that are all unhealthy. Does this work?

ivitjuk commented 5 years ago

@mattklein123 in the option 1. instead of returning error message, I was actually thinking about accepting RDS and delay matching until cluster warms up. That makes more sense to me because the fact cluster is in a warming up state makes it's status uncertain (neither healthy not unhealthy). So we wait for resolution of that uncertainty before start matching the dependent RDS.

Regarding the option 1. vs 2., would it make sense to implement 1. first and then work 2.? I think 1. would unblock us faster since it would not require any change on the ADS side. Then we could work on the option 2.

@htuch That definition of warm is perfectly acceptable. In addition to EDS, would the same also apply to LOGICAL_DNS? LOGICAL_DNS led to this behavior in the first place.

mattklein123 commented 5 years ago

@mattklein123 in the option 1. instead of returning error message, I was actually thinking about accepting RDS and delay matching until cluster warms up. That makes more sense to me because the fact cluster is in a warming up state makes it's status uncertain (neither healthy not unhealthy). So we wait for resolution of that uncertainty before start matching the dependent RDS.

I think when you go to implement this you will realize this is going to be quite complicated, and I'm not sure the complexity is worth it. If this is the behavior you want, personally I would recommend doing (2). Granted, I haven't thought about (1) very much so it's possible I'm missing something, so feel free to sketch out an implementation if you think it's not that hard.

htuch commented 5 years ago

@ivitjuk yes, it applies to logical DNS as well, see https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/cluster_manager.html?highlight=warming#cluster-warming for the definition of cluster warming.

ivitjuk commented 5 years ago

@mattklein123 I had a look at the route management code and I see what you meant when you said option (1) with accepting and holding RDS until CDS becomes warm would not be simple. It would involve either weaving ConnectionManager instance through the RouteMatcher, VirtualHost* and RouteEntry* so they can check CDS warmth status, or changing the routing architecture to accept CDS warmth notifications from the ConnectionManager and update it's internal routing data structures. None of them seems to be simple.

So your suggestion for option (1) to return error code on RDS update when CDS is not ready makes sense.

Major difference between (1) and (2) is then the way ADS is implemented. Option (1) would essentially be a polling mechanism where ADS would retry RDS update until success or some retry limit is reached. Option (2) is a notification mechanism where Envoy would send ACK when CDS is warm, after which RDS update would follow.

I prefer the notification mechanism, hence option (2), but would have to evaluate it's implementation complexity before we make a final decision between (1) and (2).

htuch commented 5 years ago

I think you should just do (2). You don't really need to change xDS wire protocol to do this, it's sufficiently underspecified, the code is essentially on the CDS side. Since we already have warming support for CDS and LDS, just deferring the Ack seems not too hard to add.

tony-shi commented 5 years ago

We has experienced the same issue. We has configured a custom health checker which the first round health check complete time may various depends on the endpoints location to the envoy host from 10 ms to 30 ms. For now (2) seems to be promising and easy to implement for our customized envoy. By the way, i'd like to ask will this be designed as the fundamental xds protocol implied term or be a optional configurable feature in cluster configuration in the future?

ivitjuk commented 5 years ago

We decided to go with option (2) and will try to pull it to the next sprint.

@tony-shi It seems to me that we don't need a configuration option to enable this feature, as it should not break or significantly change behavior of existing xDS server implementations. @htuch do you agree?

htuch commented 5 years ago

@ivitjuk yep; this should be transparent, since Envoy always had the freedom to delay an ACK an arbitrary amount of time, correctly implemented xDS management servers should not be sensitive to this and we're strengthening rather than weakening consistency.

htuch commented 5 years ago

There's also work in the MCP protocol, a sister protocol of xDS, that has first class resource status reporting support. I actually thing this would be a more systematic approach, CC @costinm @rshriram @ozevren (not sure what Jason's GH ID is..).

tony-shi commented 5 years ago

Do you guys mind to attach a MCP doc link in this issue? Since option (2) seems diverging from a status reporting support?

costinm commented 5 years ago

@ayj (Jason)

Option 2 seems reasonable short term, we'll need to make small adjustments in Istio, right now we don't wait for the ACK, just use it for metrics. Any estimate on schedule ? I've been working on this but not expecting to get it in Istio 1.1. It would be great to have a flag or way to control it, even if it seems safe.

ayj commented 5 years ago

See the MCP Enhancement proposal for an overview. You'll need to join the istio-team-drive-access@ mailing list to access.

rshriram commented 5 years ago

fwiw, there is already a validate_clusters option in the route configuration. So, you have option 1 implemented in Envoy (if I read this long thread properly). Option2 to me has some drawbacks. EDS update received or DNS resolved does not imply they were successful or that one or more hosts were obtained.

ivitjuk commented 5 years ago

I was able to start working on the option (2) yesterday. So far I have a sketch implementation for which I would like to get early feedback before I continue pursuing this idea further.

I put the implementation and testing details in the commit message. Let me know if that makes sense,

ivitjuk commented 5 years ago

@htuch can you check the POC I posted in the previous message? My concern with that approach is that pausing the whole CDS might have negative impact on the resource update latency. If that's the case, it seems to me that api_state_ in GrpcMuxImpl would need to become aware of the in-flight updates so we don't process them in queueDiscoveryRequest().

htuch commented 5 years ago

@ivitjuk I had a look, it seems reasonable to me what you are doing. I'm not sure if modifying GrpcMuxImpl would change latency, but maybe I'm missing something. I'd be interested in seeing tests to validate things like having two CDS updates back-to-back with only partial warming completed after the first CDS update. So far, we have only used pause/resume synchronously, so using them async is going to make reasoning here a bit trickier.

htuch commented 5 years ago

@ivitjuk also, you might be interested in tracking where https://github.com/envoyproxy/envoy/issues/5168#issuecomment-457985329 is heading. It's not the same as this issue, but there is common ground on cluster warming and possible changes to GrpcMuxImpl. CC @ramaraochavali

ivitjuk commented 5 years ago

Published a PR: #5887

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

htuch commented 5 years ago

I'm going to close this out based on #5887 merging, there is the more general problem but let's reopen when someone wants to drive this work.

jwm commented 5 years ago

We came across this problem with Ambassador. We're doing blue/green deploys by creating a new namespace for each deploy, which introduces a new Cluster each time.

Ambassador's Ambex ADS service feeds the clusters, then listeners. When Envoy receives the listeners, the corresponding cluster isn't warm yet, causing 503s from attempts to route requests to the unknown cluster.

For now, we're using weights to work around this. We create the Services in the new namespace with a weight of 0, wait an arbitrary amount of time, then set the weight to 100. I'll give endpoint routing a shot if I have a chance.

Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:120] Received gRPC message for type.googleapis.com/envoy.api.v2.Cluster at version v2490
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:97] Pausing discovery requests for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment
Jul 1 14:09:06 ambassador-pqz4d ambassador 2019/07/01 18:09:06 notify: sh /ambassador/post_watt.sh http://localhost:8002/snapshots/2491
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:97] Pausing discovery requests for type.googleapis.com/envoy.api.v2.Cluster
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_subscription_impl.cc:61] gRPC config for type.googleapis.com/envoy.api.v2.Cluster accepted with 8 resources with version v2490
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:120] Received gRPC message for type.googleapis.com/envoy.api.v2.Listener at version v2490
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:97] Pausing discovery requests for type.googleapis.com/envoy.api.v2.RouteConfiguration
Jul 1 14:09:06 ambassador-pqz4d ambassador warning [config] [source/common/config/grpc_mux_subscription_impl.cc:72] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener(s) ambassador-listener-8443: route: unknown weighted cluster 'cluster_https___graphql_orinoco_05cd6e61-0'
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:120] Received gRPC message for type.googleapis.com/envoy.api.v2.Listener at version v2490
Jul 1 14:09:06 ambassador-pqz4d ambassador debug [config] [source/common/config/grpc_mux_impl.cc:97] Pausing discovery requests for type.googleapis.com/envoy.api.v2.RouteConfiguration
Jul 1 14:09:06 ambassador-pqz4d ambassador warning [config] [source/common/config/grpc_mux_subscription_impl.cc:72] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener(s) ambassador-listener-8443: route: unknown weighted cluster 'cluster_https___graphql_orinoco_05cd6e61-0'

envoyproxy / envoy

Support for resource readiness feedback in XDS #5174