airbnb / nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
MIT License
942 stars 151 forks source link

Nerve reports affinity if provided in configuration #81

Closed scarletmeow closed 8 years ago

scarletmeow commented 8 years ago

Haproxy Affinity Configuration in Synapse

Synapse registers service instances and assigns them equal weights in the pool. However in reality the cost of connecting to different instances is not equal: instances could be in different AWS availability zones or be part of different network topologies. There's no existing method for Synapse to support partitioning instances into separate pools based on affinity and to bias load balancing towards the pool with similar affinity.

NOTE: This is different from setting configuration for 'weight' on a per service instance- the weight of a service instance should be computed relative to the value of affinity that Synapse parses from its configuration.

This proposal describes how to add affinity support in Synapse through a new affinity_options field in Synapse haproxy configuration. This configuration will define how to map service data (from Nerve or other data sources) to an affinity value and how Synapse will use affinities to distribute traffic between the two pools. This design also requires Nerve to support reporting of additional service data, so users can report affinity values per service instance.

Instead of adding another whitelisted parameter to the service data that is reported, Nerve should support a field extra_data that contains a bag of values and is configurable on a per-service basis. Advanced users of Nerve/Synapse can enable reporting of service instance affinity into the extra_data bag so that Synapse can parse the information to construct the appropriate HAProxy configuration.

Example configuration and data payloads

Nerve service configuration (my_service.json)

Nerve will now accept additional service data under the extra_data field.

{
     "host": "1.2.3.4",
     "port": 3000,
     "extra_data": {
       "affinity": "us-east-1",
       "custom_tag": "custom_value",
      ...
     }
     ...
}

Nerve zookeeper reporter payload (compact JSON written to ZK)

Nerve reporters will now publish the extra_data field in service data.

{"host":"1.2.3.4","port":3000,"name":"i-asdf1234","extra_data":{"affinity":"us-east-1","custom_tag":"custom_value"}}

Synapse configuration (my_service.json)

Synapse supports affinity_options in service configuration. Users should provide the affinity for the main server pool, the path to affinity with respect to service instance data and affinity configuration. There would be a "weighted" mode such that when weight is set to 95, 95% of traffic is routed the main pool of similar affinity servers and the remaining 5% routed to the other pool. The "backup" mode would set up the dissimilar affinity pool as HAProxy backup servers.

{
  "default_servers": [ ... ],
  "discovery": { ... },
  "haproxy": {
    "server_options": ...,
    "listen": [ ... ],
    "backend": [ ... ],
    "affinity_options": {
      "affinity_path": "/extra_data/affinity",
      "affinity": "us-east-2",
      "mode": "weighted",
      "weight": 95
    }
  }
}

Synapse-generated HAProxy Configuration

(TODO: check if HAProxy supports fractional weights)

Assuming the host has similar affinity to 1.2.3.4, 1.2.3.5 and 1.2.3.6, this would be a possible HAProxy configuration given weight=95:

frontend my_service
        bind ::1:3597
        mode tcp
        option tcplog
        bind localhost:3597
        default_backend my_service

backend my_service
        mode tcp
        option tcplog
        server 1.2.3.4:3000_i-foo 1.2.3.4:3000 weight 38
        server 1.2.3.5:3000_i-bar 1.2.3.5:3000 weight 38
        server 1.2.3.6:3000_i-baz 1.2.3.6:3000 weight 38
        server 1.2.4.1:3000_i-zapp 1.2.4.1:3000 weight 3
        server 1.2.4.2:3000_i-brannigan 1.2.4.2:3000 weight 3

On a host with similar affinity to only i-zapp, given weight=95:

frontend my_service
        bind ::1:3597
        mode tcp
        option tcplog
        bind localhost:3597
        default_backend my_service

backend my_service
        mode tcp
        option tcplog
        server 1.2.3.4:3000_i-foo 1.2.3.4:3000 weight 1
        server 1.2.3.5:3000_i-bar 1.2.3.5:3000 weight 1
        server 1.2.3.6:3000_i-baz 1.2.3.6:3000 weight 1
        server 1.2.4.1:3000_i-zapp 1.2.4.1:3000 weight 76
        server 1.2.4.2:3000_i-brannigan 1.2.4.2:3000 weight 1
jolynch commented 8 years ago

Can you give some more context for what affinity is and what the plan is on the synapse side?

SonicWang commented 8 years ago

@jolynch Hi Joseph, nice to meet you through code review! :)

To give you a bit more context on this: @scarletmeow is working on a change to make smartstack to respect affinity group for better routing. The reason we want to have this kind of change is that currently no additional information are available for affinity groups (or to be more specific in Airbnb's case Availability Zones). Because of this, all backends are treated equally by haproxy for round robin routing. We did some preliminary study on our traffic and found that on average we make quite a few number of inter-service calls from our main web app to other services. Because we always set up our servers in multiple AZs we are making a lot of cross-AZ http calls. On average there is additional 0.8-1.0ms for a single service call, inter-AZ v.s. intra-AZ. Inter-AZ latency between us-east-1a and us-east-1b is the worst according to our stats.

As you can see, if we have this kind of set up: service A -> service B -> service C and each hop is a inter-AZ call, we could end up incurring multiple ms of extra latency. If a single request processing within service A incurs dozens requests to B and then to C, we would end up incurring a very long latency. This is why AZ awareness will be helpful for lowering our network latency.

But AZ is just specific to AWS so we'd want to make this more generic. That's why we decide to name this affinity group. Our whole plan is to: 1) make Nerve report this information 2) make Synapse understand this information from Nerve nodes 3) make Synapse write haproxy configs in a way that uses this info(weighted rr)

How Nerve gets the affinity information is outside Nerve. This information will be generated by whatever that's writing the Nerve config. In our case, it's our Chef scripts that'll generate this information.

Hope this gives enough information about our intention of this PR.

jolynch commented 8 years ago

@jnb This is relevant to your interests

Hello :-)

Yea we ran into this problem a few years ago and discussed a change like this. We ended up going the route of leaving nerve/synapse simple and putting the complexity into how we configured it. For example, we can configure services to operate at the AZ, region, or superregion (group of regions) levels which at the end of the day ends up getting compiled down to the proper nerve/synapse configs. We explain a little of this in the paasta docs and the implementation of the configuration logic is here and here. We thought this was more consistent with the unixy design of Smartstack and that strategy plus haproxy_server_options and multiple backends/frontends allows us to do cool things like automatic failover. At the time we made those calls though we didn't have commit access and the nerve/synapse projects were sorta ... not maintained ... so we should definitely reconsider what we're doing now!

I see pros and cons to this approach every day and I'd love to chat about ways we can make it cleaner by putting the complexity of multi-latency zone intelligence into the appropriate places. You guys work at airbnb right? Want to grab coffee/lunch sometime early next week and we can chat about the plans here. In the meantime of course ship what you gotta ship, just want to sync up so we can bring what we do in line with the projects recommendations.

jolynch commented 8 years ago

@SonicWang The tldr of all those links is that an alternative way to achieve what you're talking about is:

  1. Have Nerve register services in various AZs in a different locations in zookeeper
  2. Have Synapse look in the right place (so 1a looks in /nerve/useast1a/ and 1b looks in /nerve/useast1b/).

If you need to have different levels you can just register in multiple places (this would be the extra_advertise we talk about) along with the environment_tools conversions from like az to region etc ...

scarletmeow commented 8 years ago

@jolynch I think the use-case for affinity we're looking at is to build up a pool of services in the same AZ and a pool of services as a backup from the other AZs. In the layout you described, this seems to require a recursive subscription to load all the services from various affinities to build the main/backup pools. Should there be there any concern about performance when doing these watches in ZK?

SonicWang commented 8 years ago

@jolynch Thank you very much for the links and suggestions. I carefully studied them in the afternoon. Yeah this is a good alternative working around nerve/synapse for building region awareness.

This solution may not be easy for Airbnb to adopt though. The way Paasta does basically registers backends in different regions under different znodes. We are currently registering all backends under the same prefix.

After some discussion offline with @igor47 we think it still has value to let Nerve pass in arbitrary data to zookeeper so that on synapse side we can have more flexibility. Up to this point I think this PR's title should be slightly changed to say that it lets nerve pass in arbitrary config data to ZK. This is not specific to AZ-awareness but it'll make doing region aware routing on synapse side easier.

jolynch commented 8 years ago

I apologize ahead of time for the book of a reply here ... I'm super interested in sitting down to chat with you guys about this stuff if you have time (don't block on it, but let's sync up if possible).

In the layout you described, this seems to require a recursive subscription to load all the services from various affinities to build the main/backup pools.

You end up requiring N registrations per service where N is the number of levels you want to care about. So for example, let's say you have az and region, you would have 2 registrations per service, one in the az and one in the region path. Presumably you want to prefer instances in the az but fall back to the region in case of outage of all instances in an az. Nerve would register each instance in both that node's AZ and in it's region (/nerve/useast1a/<service> and /nerve/useast1/<service>). Then in Synapse you have a bare backend (no port) for the useast1 pool with a well defined backend_name (this is why I added backend_name) and a frontend/backend for the service itself that looks in the useast1a registrations. Finally you have ACLs on the frontend as shown in Farm Failover to failover to the useast1 pool in case of failure. You can also use nerve's haproxy_server_options option to register certain instances as backups by passing backup.

Basically you can achieve full farm failover with backup nodes via usage of nerve's haproxy_server_options and synapse's backend_name, backend and frontend sections.

Should there be there any concern about performance when doing these watches in ZK?

Actually the watches are fine, it's the connections that killed us. That's why we contributed connection pooling. The watcher churn (having to send so many watch fires when nerve bounces) is what's looking like the next potential bottleneck for us now, which is why I'm working on in #80 .

This solution may not be easy for Airbnb to adopt though. The way Paasta does basically registers backends in different regions under different znodes. We are currently registering all backends under the same prefix.

This exact reason is one of the biggest pros to doing this kind of logic at the configuration layer. We've "moved" our registrations between Zookeeper clusters and paths within those clusters many times by dual registering from nerve (registering in the old and new locations) and then switching synapse over from discovering at the old location to the new location

After some discussion offline with @igor47 we think it still has value to let Nerve pass in arbitrary data to zookeeper so that on synapse side we can have more flexibility.

Definitely. I have been trying to figure out what data should be in the registrations and so far the best description we have is in synapse. We already added weight and haproxy_server_options to give synapse more control, but I definitely think we need to figure out a good way to do this going forward. Basically we can either keep adding "tags" at the top level, or we can try to group them into some kind of hierarchy, or we can do something else. I dig it, we need to do something, just not sure what.

SonicWang commented 8 years ago

@jolynch Thank you for such an insightful reply! Now I see how one would move from one scheme to another with multiple registrations. Let's figure out what to pass in to nerve then. We are thinking about not adding more top level tags. Instead, we are thinking about passing an extra_data top level tag which will group all extra tags under it. As an example we can then pass in affinity tags (az, region, etc.) under it. Wanted to get your thoughts about this.

scarletmeow commented 8 years ago

Updated description with proposed plan for affinity configuration in Synapse and pushed new patchset for Nerve to support reporting the extra_args parameter.

@jolynch , @SonicWang let me know your thoughts on this proposal!

jolynch commented 8 years ago

Just curious, what's going to differentiate meta data that goes at the top level from that which goes in extra_args? Maybe the differentiation should be if it's smartstack machinery not related to a particular ServiceWatcher in Synapse we put it at the top level and if it's a custom or specific ServiceWatcher we put it in extra_args? Basically it's an escape hatch for custom behavior on the Watcher side.

As for the affinity idea, I grok it, but I wonder if we can achieve the goal more intuitively and a less invasive Synapse change with a fallback list to other watcher's backend name(s) (and any options on how to do that like use the other watcher backends as backups or weights or automatically failover with ACLs or w.e.). I think we can express much more powerful affinity and fallback modes that way because we can like chain fallbacks. With afinity as I understand it here we can only go down one level (so we can filter a region down to an AZ, but can't for example say "prefer local AZ instances, if those are unavailable go to the region, and if those are unavailable go to another region").

scarletmeow commented 8 years ago

Just curious, what's going to differentiate meta data that goes at the top level from that which goes in extra_args? Maybe the differentiation should be if it's smartstack machinery not related to a particular ServiceWatcher in Synapse we put it at the top level and if it's a custom or specific ServiceWatcher we put it in extra_args? Basically it's an escape hatch for custom behavior on the Watcher side.

I agree, and I like your summary of the distinction between top level/custom meta data. Adding a field to support custom data payloads will enable more powerful plugin behavior in Synapse.

As for the affinity idea, I grok it, but I wonder if we can achieve the goal more intuitively and a less invasive Synapse change with a fallback list to other watcher's backend name(s) (and any options on how to do that like use the other watcher backends as backups or weights or automatically failover with ACLs or w.e.).

I'm open to suggestions. I think you raise a great point here that the current proposal doesn't support multi-tier fallback and we could do better to make this syntax more extensible.

What I wanted to emphasize in my proposal was the ability for Synapse to configure a instance-specific groupings of the available instances based on reported service data. For example, being able to weight traffic in the favor of instances in the same AZ. This configuration should be possible with minimal logic on the Nerve reporter side.

I think we can express much more powerful affinity and fallback modes that way because we can like chain fallbacks. With afinity as I understand it here we can only go down one level (so we can filter a region down to an AZ, but can't for example say "prefer local AZ instances, if those are unavailable go to the region, and if those are unavailable go to another region").

I think the proposal could be extended to support generic tiers of pools. We could think of the ordered configuration as a series of case statements to sort service instances into backend pools.

Let's assume that the service has the following service data reported by Nerve:

{
  "extra_args": {
    "affinity": {
      "az": "us-east-1a",
      "region": "us-east-1",
      "dc": "us"
    }
  }
}

Putting aside the details of how the generated HAProxy configuration would look like, what if the Synapse configuration was defined as follows:

{
  "default_servers": [ ... ],
  "discovery": { ... },
  "haproxy": {
    "server_options": ...,
    "listen": [ ... ],
    "backend": [ ... ],
    "affinity_options": [
        { "affinity_group": "/extra_args/affinity/az", "affinity_value": "us-east-1a", "mode": "weighted", "weight": "95" },
        { "affinity_group": "/extra_args/affinity/region", "affinity_value": "us-east-1", "mode": "backup" },
        { "affinity_group": "/extra_args/affinity/dc", "affinity_value": "us", "mode": "backup" }
      ]
    }
  }
}

This configuration would achieve the following traffic pattern:

95% of traffic goes to the service instances in us-east-1a 5% of traffic will be routed to other instances in us-east-1 (ie us-east1b, us-east1c ...) as a backend pool without frontend. If there are no service instances available to serve traffic found in us-east-1, then traffic will fallback to instances found in the us datacenters.

Each service instance will belong to the first matching affinity pool from the provided sequence of affinity groups.

SonicWang commented 8 years ago

Kindly ping @jolynch @igor47 . I think @scarletmeow's proposal seems like a good strategy to combine chained failover configuration with weighted routing.

jolynch commented 8 years ago

I think I slightly prefer re-using the 'label'/'tag' abstraction that Kubernetes has popularized instead of extra_data, but it's basically the same thing so . Let's also make sure to make it clear in the README this is arbitrary k-v data that anyone can add for any reason.

I like the idea of Nerve providing more info to Synapse, I always have, and this PR solves that, shipit. However, I'm still pretty worried about the Synapse side affinity proposal because any configuration language we can come up with will be a strict subset of HAProxy's ACLs and we'll have to support it for any possible frontend that Synapse might want (e.g. someone is working on an nginx frontend PR right now) and I'm worried about how these are going to interact with the weight and haproxy_server_options options that exist already (for example if the affinity weight conflicts). I'm all for making it easier to work with HAProxy, but I think we should definitely tread lightly if we're going to define our own language for failover. I'm going to chat with @jnb tomorrow and see what we can come up with.

SonicWang commented 8 years ago

@jolynch I think the difference between here and labels in Kubernetes is that extra_data can have arbitrary payload (though I don't think it's best practice to attach huge payload). Kubernetes specifically said not to allow arbitrary structured data in labels so it's purely used for tagging. But yeah, I see you point regarding tagging.

I think right now making advanced usage of synapse requires significant effort in setting up tooling outside smartstack (such as usage of Paasta) and assuming topology outside smartstack (such as multi-registering). Our goal is to provide a reasonable solution within Synapse to address that. I think weight right now is not even used directly inside Synapse so I guess it's aimed for helping with your tooling right? Our proposal can still respect the weight even when we do affinity based ACL. For example, within a tier we can use the weight passed in from backend servers instead of assigning uniformly across machines. For the haproxy_server_options we currently don't even have such usage within Airbnb so I'm not sure what the plan is there. But as a conservative effort we can change our proposal a bit so that if haproxy_server_options is present we ignore the affinity directs.

igor47 commented 8 years ago

specifically on the topic of how to configure synapse to make use of this data, the idea @SonicWang and I discussed was a filtering mechanism. in other words, when create a service with a set of backends, we should allow synapse to filter which backends make it into the service based on arbitrary filters over the tag data.

in this way, a single pool of backends can be broken up into several haproxy services based on the tags. for instance, for all backends under service service1 you can create 3 haproxy services service1-us-east-1a, service1-us-east-1b, and service2-us-east-1c.

the actual routing can still be controlled by writing pure haproxy configuration using the synapse-provided mechanisms (the haproxy section)

does that make sense? i haven't thought deeply enough about whether this is viable.

igor47 commented 8 years ago

this PR LGTM. i agree with @jolynch that using something like labels vs extra_args seems nice -- do we actually anticipate wanting deep data structures in there? we probably should avoid that. i would support renaming extra_args to labels as a UI improvement to better communicate to engineers how to use this new power.

after reading the entire discussion, i don't think synapse should have notions of affinity. i think synapse should create services containing discovered backends, and we should leave routing logic to the load balancer (haproxy, nginx, whatever). hopefully this will inform @scarletmeow 's follow-up work on synapse to start making use of these additional labels.

jolynch commented 8 years ago

@igor47, @SonicWang, @scarletmeow

I dig the idea of filtering backends out based on tags/labels on the Synapse watcher side and using this to generate multiple backends with their own routing configurations. The additional watches are really not a big deal in our experience and this design would reduce writes (because we don't have the same info duplicated in N paths) which has been a big deal.

scarletmeow commented 8 years ago

@igor47, @SonicWang, @jolynch

Thank you all for the feedback. It seems like affinity is too loaded a term to introduce to Nerve/Synapse. I've pushed a new patchset with extra_args now renamed labels. If it looks good, please merge it!

To summarize the thread, we will add support in Synapse for users to setup multiple backends for each service. Each backend will contain servers grouped by user-defined filters that use service tag/label data to split service instances across backends. Users will also be responsible for constructing the Synapse frontend configuration that describes how traffic is routed to each custom backend. We still need to discuss the filter configuration language, but let's tackle that in the next PR.

@igor47 , is the following HAProxy configuration what you had in mind in your example of failing over from us-east1 -> us-east2 -> us-east3?

frontend service1
        bind ::1:80
        # User-defined configuration in service/haproxy/frontend:
        acl service1-us-east1-down nbsrv(service1-us-east1) eq 0
        acl service1-us-east2-down nbsrv(service1-us-east2) eq 0
        # if no services up in east1, failover to east2
        use_backend service1-us-east2 service1-us-east1-down
        # if no services up in east1 or east2, failover to east3
        use_backend service1-us-east3 service1-us-east1-down and service1-us-east2-down
        default_backend service1-us-east1

# Synapse generated backends, filtered by the value at /labels/az
# TBD: Synapse filter configuration needed to generate this set of backends 
backend service1-us-east1
        server 1.2.1.4:3000_i-foo 0.0.0.0:1000 check inter 60s rise 3 fall 2

backend service1-us-east2
        server 1.2.2.4:3000_i-bar 0.0.0.0:1000 check inter 60s rise 3 fall 2

backend service1-us-east3
        server 1.2.3.4:3000_i-baz 0.0.0.0:1000 check inter 60s rise 3 fall 2
jolynch commented 8 years ago

This looked good to me so I went ahead and merged it. I'll start rolling this out in our infra tomorrow :-)

SonicWang commented 8 years ago

@jolynch haha you are so fast! Thanks for merging.

@scarletmeow I had the exact discussion with Igor this afternoon. You already figured the idea out :) (acl can be tweaked a bit more but overall it's sth. like this. Let's chat tmr on the synapse piece.

scarletmeow commented 8 years ago

@jolynch Thanks for the quick merge!

@SonicWang Sounds good to me :) . I'm currently reading about Kubernetes' label expressionsand I think their approach is a good start (although, I prefer their API syntax to the comma-delimited requirements syntax).