Add support for N levels of failover endpoints

surki commented 6 years ago

I would like to use Envoy to load balance long lived TCP connections under the following setup: Say, I have 2 machines m1 and m2. I would like all traffic to go to m1. The traffic should go to m2 only if m1 fails (either active/passive healthcheck) Or if it exceeds certain number of connections defined per host.

This is similar to the "first" load balancer in HAProxy: https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#balance

surki commented 6 years ago

So I did quick PoC to see what it would take to add this feature, here are the changes:

Envoy changes: https://github.com/envoyproxy/envoy/compare/master...surki:load_balancer_standby Envoy-api changes: https://github.com/envoyproxy/data-plane-api/compare/master...surki:load_balancer_standby

and corresponding config file: (it is a proxy for mysql for read only traffic, it sends read only queries to "slave", if "slave" fails/reaches capacity, it will redirect the connections to "master". Also it does an active health check by sending "auth packet" (with username "proxyhealth") and an immediate "quit command packet" using mysql protocol 4.1)

{
  "listeners": [{
    "address": "tcp://127.0.0.1:33060",
    "filters": [{
      "type": "read",
      "name": "tcp_proxy",
      "config": {
        "stat_prefix": "mysql",
        "route_config": {
          "routes": [{
            "cluster": "mysql"
          }]
        }
      }
    }]
  }],
  "admin": {
    "access_log_path": "/tmp/envoy.log",
    "address": "tcp://127.0.0.1:9901"
  },
  "cluster_manager": {
    "clusters": [{
      "name": "mysql",
      "connect_timeout_ms": 2500,
      "type": "strict_dns",
      "lb_type": "standby",
      "health_check": {
        "type": "tcp",
        "timeout_ms": 2000,
        "interval_ms": 10000,
        "unhealthy_threshold": 2,
        "healthy_threshold": 2,
        "service_name": "mysql",
        "send": [
          {"binary": "2d0000"},
          {"binary": "01"},
          {"binary": "00820000"},
          {"binary": "00000001"},
          {"binary": "21"},
          {"binary": "0000000000000000000000000000000000000000000000"},
          {"binary": "70726f78796865616c74680000"},
          {"binary": "010000"},
          {"binary": "00"},
          {"binary": "01"}
        ],
        "receive": [
          {"binary": "07000002000000"}
        ]
      },
      "hosts": [{
        "url": "tcp://mysql_slave:3306",
        "id": 0,
        "max_connections": 2048
      }, {
        "url": "tcp://mysql_master:3306",
        "id": 1,
        "max_connections": 1024
      }]
    }]
  }
}

Not sure if above changes are inline with general Envoy architecture/good practice, few questions:

Looks like currently supported load balancing algorithms implicitly assume http requests. The above change is specific to TCP Proxy, though I don't see any reason why it won't be applicable to http requests. I couldn't find anywhere to tell the load balancer algorithm to act on "request" or "connections", short of putting the type in algorithm name itself.
I am not sure what is the best way to propagate "id" and "max_connections" all the way to individual host entries available in "LoadBalancer::chooseHost()".
API v2 (i.e., protobuf3) vs JSON v1: I am not sure if adding things into JSON is allowed now. Should I just stick to the protobuf3 changes alone and stop using current .json config format?
There may be multiple resolved host IPs in LoadBalancer::chooseHost() for each host specified in the config file (multiple A/AAAA records + DNS type we use for cluster etc), so probably the better way is to apply max_connections to each resolved IP independently.

PS: We are also planning to add MySQL filter for exposing interesting stats (just like other filters like DynamoDB etc does). I see that there is an experimental Lua support, probably I may try out that for MySQL protocol parsing/stats extraction.

mattklein123 commented 6 years ago

Some quick questions:

1) Why is ID needed? 2) Do we really need max connections on a per host basis? Or can it be the same for all hosts in a cluster?

surki commented 6 years ago

ID: Need a way to deterministically pick up the host from first to last. Since the DNS resolving happens async for all hosts concurrently, the values populated in LoadBalancerBase::hostset will have random order. Any other way to achieve the order? (I need to choose hosts in the exact order specified in the json config file)

max_connections per host basis: I thought it will be more flexible since different hosts can have different througput/capacity etc. We can move max_connections out to cluster level if you think that is something we will not need host level (personally I do not need per host level, though I could see usecase for others).

mattklein123 commented 6 years ago

I think there is a general question here as to whether this should be a new load balancing policy, or actually built directly into the tcp_proxy code. For example, one could imagine adding the concept of a fallback cluster to tcp_proxy routing, and then having rules on when the fallback is actually used. I think @rshriram and @ggreenway had some thoughts on this. In some ways it fits a bit better in this way vs. a discrete balancing policy.

If we go with the LB built-in approach, I understand the need for ID. It seems unfortunate to have to add this to the Address structure, but I'm not sure how else to deal with this without doing a breaking change which we can't do at this point. @htuch any thoughts here?

Re: max connections per host, unless there is a specific need I would probably add this as a property of the cluster where we already have max_requests_per_connection.

I would like to get some additional thoughts on whether we should do that is a new LB type vs. building into tcp_proxy. I think I am leanings towards doing what you have done here but would like some other opinions.

rshriram commented 6 years ago

I wish we could somehow use the max connections settings from the circuit breaker. This use case seems a bit niche, to create a special load balancer (if there are other scenarios where this bin packing style LB is needed, I would love to hear them).

Even in this use case, what would you do if you have multiple slaves? It seems that the only requirement here is to use the master after slaves are filled up. And among the slaves, it doesn’t matter how connections are assigned. Is this assumption correct? If so, we could imagine creating some form of fallback cluster option that contains MySQL master and the slaves are in the main cluster. The fallback option seems like it would get more mileage across multiple use cases. You can set the circuit breaker in the main cluster to be the total number of connections acceptable in the slaves. And failing that, the fallback cluster will be picked.

For the MySQL filter, i suggest doing it in c++, for performance reasons and the fact that lua filters for tcp are not here yet. You could take a look at the mongo filter for reference.

Another idea to satisfy the particular MySQL use case you have (assuming typical MySQL installations have same needs): create the MySQL filter very much like redis filter. You can have first class support for master and slave cluster, in the filter. You will have control over when to fallback to the master.

surki commented 6 years ago

On 10/26/2017 04:53 AM, Matt Klein wrote:

I think there is a general question here as to whether this should be a new load balancing policy, or actually built directly into the tcp_proxy code. For example, one could imagine adding the concept of a fallback cluster to tcp_proxy routing, and then having rules on when the fallback is actually used. I think @rshriram https://github.com/rshriram and @ggreenway https://github.com/ggreenway had some thoughts on this. In some ways it fits a bit better in this way vs. a discrete balancing policy.

If we go with the LB built-in approach, I understand the need for ID. It seems unfortunate to have to add this to the |Address| structure, but I'm not sure how else to deal with this without doing a breaking change which we can't do at this point. @htuch https://github.com/htuch any thoughts here?

Yes, ID is the only remaining part (we can move out or drop max_connections, please see below). Or only if I could find another way to deterministically get the host sorted.

Re: max connections per host, unless there is a specific need I would probably add this as a property of the cluster where we already have |max_requests_per_connection|.

That should be fine. Or probably we can drop off max_connections altogether, it can be just a true standby load balancing (i.e., load balance only on active/passive health check fail. I was just trying to get equivalent HAProxy config for no good reason.

I would like to get some additional thoughts on whether we should do that is a new LB type vs. building into tcp_proxy. I think I am leanings towards doing what you have done here but would like some other opinions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1929#issuecomment-339503688, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC9hPQYyRBmieMQaN77nf-jAQRmNM_gks5sv8LrgaJpZM4QEV8j.

surki commented 6 years ago

On 10/26/2017 08:20 AM, Shriram Rajagopalan wrote:

I wish we could somehow use the max connections settings from the circuit breaker. This use case seems a bit niche, to create a special load balancer (if there are other scenarios where this bin packing style LB is needed, I would love to hear them).

Thinking about bit more on this, probably we don't need max_connections (please see the above reply). All I wanted is: "if m1 fails, switch over to m2. If m2 fails switch over to m3" etc.

Even in this use case, what would you do if you have multiple slaves? It seems that the only requirement here is to use the master after slaves are filled up. And among the slaves, it doesn’t matter how connections are assigned. Is this assumption correct? If so, we could imagine creating some form of fallback cluster option that contains MySQL master and the slaves are in the main cluster. The fallback option seems like it would get more mileage across multiple use cases. You can set the circuit breaker in the main cluster to be the total number of connections acceptable in the slaves. And failing that, the fallback cluster will be picked.

There can be more than 2 machines. I am okay either way: LB or some kind of cluster fallback/tcp proxy routing. LB seemed simple enough to implement (and it could mimic current HAProxy setup), so I could get to other interesting bit (MySQL filter)

For the MySQL filter, i suggest doing it in c++, for performance reasons and the fact that lua filters for tcp are not here yet. You could take a look at the mongo filter for reference.

Agree, C++ could be the right choice. I was not sure how easy it is to maintain an out of tree Envoy filter, so thought Lua could simplify things. Taking a look at other filters, the filter API is very nice/simple and shouldn't be too difficult to maintain.

Another idea to satisfy the particular MySQL use case you have (assuming typical MySQL installations have same needs): create the MySQL filter very much like redis filter. You can have first class support for master and slave cluster, in the filter. You will have control over when to fallback to the master.

Interesting. Unfortunately, for us, the app makes this decision currently (I didn't show full config above, so we have multiple clusters, with different combinations, like ClusterA: "m1, m2, m3", ClusterB: "m2, m1, m3" etc). But that is something could be moved to a filter at some point down the road.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1929#issuecomment-339534994, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC9hFeXmSgOe0ZWVC2BTGQ20g7aRW5Oks5sv_OGgaJpZM4QEV8j.

mattklein123 commented 6 years ago

If you don't need max connections lets just drop that. Also, I just looked at the code and the hosts actually should remain ordered for static/strict_dns clusters. For strict_dns this is truly only if DNS for each "host" returns a single IP, which I'm guessing is your use case: https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/upstream_impl.cc#L483

Assuming the above is OK for you, I think this is pretty simple. Basically just write a new LB which assumes hosts are in order, and just always sends to first, if failed, sends to next, etc. Would that work?

Agree, C++ could be the right choice. I was not sure how easy it is to maintain an out of tree Envoy filter

If you are going to do a MySQL filter and are willing to go through code review process, etc. please just contribute it! I think there are many people that would love a MySQL filter.

ggreenway commented 6 years ago

The more general solution is priority groups (at least that's what f5 called it).

Basically, you have multiple groups/pools of machines, and the groups are priority ordered. So you try to pick a host from the first group; if that fails, try from the next group, etc.

To represent that in envoy, I think it makes more sense to put it in the cluster/lb side than in the tcp_proxy. This could also be applicable to http easily enough.

I see two ways to doing it that could make sense:

1) New LB type on clusters for this mode, and have some way for each host in the cluster of annotating which priority group it is in.

2) Create a cluster for each priority group, which could use any LB method and is one of the existing types. Then create another cluster of a new type and lb_type. The type would just point to the other clusters by name, in order of priority. The lb_type would logically try to lb-pick from each group in order until it got one that worked.

mattklein123 commented 6 years ago

Agreed with @ggreenway that would be a nice general solution. I like (2). It's somewhat similar/related to what @zuercher did with the subset LB.

rshriram commented 6 years ago

You mean (1)? Using metadata is 1. You could have reserved metadata keywords that define priority.

On Thu, Oct 26, 2017 at 6:24 PM Matt Klein notifications@github.com wrote:

Agreed with @ggreenway https://github.com/ggreenway that would be a nice general solution. I like (2). It's somewhat similar/related to what @zuercher https://github.com/zuercher did with the subset LB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1929#issuecomment-339818448, or mute the thread https://github.com/notifications/unsubscribe-auth/AH0qdxlSiVgX9Z9FDx_CSFLlgnympwTDks5swQaXgaJpZM4QEV8j .

mattklein123 commented 6 years ago

Yes I guess we could use metadata for (1), though this does not work with DNS which I think is going to be quite common for this. Personally I kind of like (2) which is a wrapper cluster and a wrapper LB which operate on sub-clusters.

I think there are several ways to implement this that are fine with me.

mattklein123 commented 6 years ago

FYI having a long convo with @alyssawilk about this offline. I think we are potentially converging on doing (1) above (adding priority group per endpoint), but will update this ticket after some more discussion.

alyssawilk commented 6 years ago

Really long update over here: https://github.com/envoyproxy/envoy/pull/1980#issuecomment-341436710

tl;dr I'll rewrite 1980 to address this use case as well :-)

surki commented 6 years ago

Sorry was on vacation for last week+.

@alyssawilk @mattklein123 I see that we are going to do priority grouping. Is this being worked on currently? If this is being implemented currently, do let me know if I could help in anyway (testing etc). If it is not being implemented/planned yet, I could help with the implementation as well (though it looks like it is being clubbed with larger re-factoring, so not sure how much of a help I can be of ...)

alyssawilk commented 6 years ago

I was planning on picking it up next week but as I'm traveling I'm realistically more likely to get something out for review the week following. If you think you could do some thing faster feel free to pick it up - just let me know!

On Thu, Nov 9, 2017 at 9:20 AM, Suresh Kumar notifications@github.com wrote:

Sorry was on vacation for last week+.

@alyssawilk https://github.com/alyssawilk @mattklein123 https://github.com/mattklein123 I see that we are going to do priority grouping. Is this being worked on currently? If this is being implemented currently, do let me know if I could help in anyway (testing etc). If it is not being implemented/planned yet, I could help with the implementation as well (though it looks like it is being clubbed with larger re-factoring, so not sure how much of a help I can be of ...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/1929#issuecomment-343168251, or mute the thread https://github.com/notifications/unsubscribe-auth/ARYFvUdERCT5arV_8u4nV5yxfg3TiHXbks5s0wo6gaJpZM4QEV8j .

surki commented 6 years ago

I will wait for your changes, but do let me know if you want me test/try out your changes earlier.

alyssawilk commented 6 years ago

Ok, I'm like 70% of the way done with the naive implementation which while sufficient to close this issue off, is not yet sufficient for my self-respect :-P

My plan A would be to mimic the load balancing code I'm most familiar with. It basically has a "overprovision factor" of (default 1.4) and starts spilling traffic out of a priority level when fudge_factor * host-set-health-ratio < 1. I guess we'd have to adapt this to priority levels to spill iff the sum of the health of the higher priority levels was less than 1.

Assuming you had 2 priority levels: 100% healthy backends. All traffic goes to P=0 As P=0 starts going unhealthy (80% health) there's initially spillover. When P=0 hits ~70*% healthy, traffic starts trickling to P=1. By the time P=0 is 60% healthy, roughly 16%** of traffic is hitting P=1 People who want faster or more gentle failover can tweak the over-provision factor accordingly.

I'd have to dig up details on the ratio drop-off when P=1 starts going unhealthy as well but I figure rather than doing it in code I'd poll for improvements / enhancements with enough time to dicker while I wrap up the naive implementation for ring hash etc.

1.4 *.7143 = 1.0002 so it flips at 71.5% healthy * 1- 1.4 .6 = .16

mattklein123 commented 6 years ago

@alyssawilk ^ SGTM. The only complexity is that I think it would probably break the way the zone aware balancing algorithm works in terms of spillover calculations between zones (or make it substantially more complicated). I'm honestly not sure this matters. If failover is occurring we could potentially even disable zone aware balancing. I would keep it in mind though.

alyssawilk commented 6 years ago

@surki The behavior that you asked for should work now if you want to play around.

2 more follow-up PRs but they're for gentle failover in the case there's more than one endpoint per priority set.

envoyproxy / envoy

Add support for N levels of failover endpoints #1929