containous / traefik-extra-service-fabric

Traefik extra: Service Fabric Provider
Apache License 2.0
12 stars 14 forks source link

Load Shedding with SF and Traefik #43

Open lawrencegripper opened 5 years ago

lawrencegripper commented 5 years ago

I've recently had some questions around the following problem statement via email, I wanted to open the conversation up as an Issue to get opinions/thoughts and allow others to contribute/learn.

Given the problem statement: "Each service should be able to self-report that it's full, at this point only existing sessions should be routed to it. New sessions should go to other nodes". I think it's possible with the existing Traefik code and if that fails I think it should be possible with some code changes.

Proposed solution:

Currently all the servers in a backend have a default weight of 1 set in the configuration template which then uses this function to get weight or return a default of 1. The template is used to output the configuration from the SF plugin Traefik. The template canbe customized for individual uses. So you could take a copy of this template and then edit it.

The code supports picking up dynamically set labels, which can be set by calling the SF API, as explained here. So services can set labels themselves using this method.

So the aim is to have the service make a call to set a label which sets it's particular weight then have the template pick this up and make those changes appear in Traefik (all the instances in the cluster).

To do this we'd change this line "weight = {{ getWeight $service }}" to something like this

        {{ $instanceLabel := printf("traefik.servicefabric.instance.%S.weight" $instance.ID)}}
        {{ $hasPerServerWeight := hasLabel $service $instanceLabel }}
        {{ if $hasPerServerWeight }}     
        weight = {{ getLabelValue $instanceLabel "1"}}
        {{ else }}
        weight = {{ getWeight $service }}
        {{end}}

This would check for a label in the form of "traefik.servicefabric.instance.INSTANCEIDHERE.weight" and set it's value as the servers weight. The InstanceID is the ID of the instance in SF so should be accessible by the code running in the service.

When the node is "full" it can set it's weight to 0. I think this will not affect existing sessions but this will need to be checked in a test.

What could go wrong with this:

  1. It relies on getLabelValue which is meant to retreive a string value won't be as "Safe" as it could be. If you set a non-int value to the label it will cause the config update to fail rather than to use it's default. (Fixable with 1 line PR to expose getLabelValueInt)
  2. The template code I've written should be roughly right but I've not had a chance to test it out so may have some syntax errors.
  3. Existing sessions may be routed away from 0 weighted stuff (simple to test routing behavior)

Hopefully this makes sense, would be interested to hear how you get on if you wanted to give this a test.

In abundance of caution, while I don't see any reason it will not work, I want to flag that this approach is not something we've used before and I'd strongly recommend testing it well before using it in a production system.

Docs

Server weight in Traefik

Other notes

This approach could be combined with the Retry option in Traefik to handle any requests which do still get routed to the node. However, this would need to tested to see what classed as a retryable error.