Alternative Placement proposal

fraenkel commented 9 years ago

Placing a Task or an LRP can be a difficult task, and one that isn't always correct nor optimal. The ever changing environment makes that a reality.

We can enumerate all the rules for placement into two simple categories, required or optional. Required are those things that are a given and cannot be ignored. Optional are requirements that would be great to have but you are willing to do without under certain circumstances. The list of optional rules are ordered in priority. That just means, you will have a higher tolerance to do without some before others.

If we use our current Task/LRP scenario, we have a bunch of required rules, e.g., disk, memory, stack, tags (placement pools), etc... We also have a list of optionals in priority order, anti Cell colocation, anti AZ colocation. I am willing to give up AZ anti-colocation before anti Cell co-location.

Placement becomes nothing more than applying the rules to create a set of possibilities. The items in the set of possibilities can also have a preferred or priority. In our case it would be memory and/or disk. When there are many to choose from, any can be selected using whatever means. If the set of possibilities are empty after applying the rules, you remove the lowest priority optional rule and retry placement.

For this placement to work, we don't really care how many AZs you have.

Thoughts?

jchesterpivotal commented 9 years ago

Sounds like a multi-objective optimisation problem (MOOP). I was taught that these are Hard To Solve (tm).

That said, there's lots of literature around solving MOOPs, including MOOPs with ranked objectives, including classifying objectives into those which cause a fatal error if they can't be satisfied, vs those which can be dropped or even "repaired".

The thing is that optimality is usually either unachievable, or if achievable, the search space is too large to enumerate, or if the search space is tractable, you find a "Pareto front" of alternative solutions which are indigistinguishable after the scores for sub-objectives have been aggregated.

Then if you add the network fallacies, you're trying to perform optimisation with noisy inputs to a selection / ranking function. This can't really be made to go away, which is where you start to head into throwing your hands in the air and accepting heuristic and, if you like cool stuff, stochastic approaches.

Amit-PivotalLabs commented 9 years ago

Here's a blog post about the Diego auction as a multi-objective constrained optimization problem.

http://blog.pivotal.io/cloud-foundry-pivotal/products/app-placement-in-cloud-foundry-diego-a-classical-optimization-problem

fraenkel commented 9 years ago

I am assuming that the divergence/accuracy of the article and the code is either due to function that has not arrived or we have learned things. What I propose and what you have described in your article and the code are very similar. There are however some differences in when it comes to what are constraints, and how to compute the objective function.

Amit-PivotalLabs commented 9 years ago

What are your different constraints and how is your objective function different?

The article is a few months old. It's also only partially describing how Diego worked at the time, and partially speculating on how it could work.

jchesterpivotal commented 9 years ago

What will the policy be when required required constraints can't be satisfied? Do we return an error to the client?

fraenkel commented 9 years ago

That is really outside the scope of Placement. You can decide to retry for some period of time, fail-fast (Tasks) or indefinitely (LRPs). The goal is to return a set of possibilities and let the caller decide on the actions to take.

Maybe my math is weak but I don't know how to represent via a single expression the requirement that Cell co-location > AZ co-location > the rest. You get to a point where a mathematical expression isn't sufficient and you need to recompute based on different inputs.

jchesterpivotal commented 9 years ago

My math is weak too :)

As I understand it, there's two things at work here. The first is to determine if a proposed solution is valid, the second is to rank the valid solutions.

Mandatory requirements first of all behave like booleans, joined by a chain of &&s. If any mandatory requirement can't be satisfied, you abort and roll out.

Otherwise you proceed to ranking, which in Amit's blog post is a sum of floats, highest score wins.

The problem is that some requirements will never be satisfiable because of logical inconsistencies, so no matter how many requests are made, they fail.

That said, I guess this is the basic problem with Computer Science anyhow. Enter, stage left, programmers responsible for working out why their request for 2000Tb of RAM, a pony and not-a-pony won't work.

onsi commented 9 years ago

@fraenkel what we have today and what is described in the doc is essentially exactly what you've described:

Required: stack, placement pools, enough capacity Optional (in rank order): anti-cell-affinity, (memory, disk, nun containers)

One can add anti-az-affinity without much more trouble. But. We're adding a wrinkle: we don't want actuals to flood if a zone goes down. That's the only reason we inject the # of zones and I can't see a robust way to do this without injecting # zones.

@jcpivotallabs MOOPs are hard if you want to find a best solution. This is almost impossible for us because we cannot control the order in which applications arrive - we've got to fill buckets optimally but have no insight into the distribution of rocks/grains beforehand and no control over when we're handed rocks/grains.

Fortunately, we don't need to find the best solution. Just good enough solutions. Turns out that this is a much easier problem and one that we've basically solved at this point.

amitkgupta commented 9 years ago

@jchesterpivotal I don't see how we allow a client to request anything logically inconsistent. Right now, the constraints on a cell to be viable is that they have at least so much memory and disk, and they have the right stack. We don't give clients free rein to define a boolean expression representing their constraints, it's a very specific expression we compute which we can see can't be inconsistent.

The way to represent multiple priorities within the same objective function is easy in this case, and it's a standard trick in the study of MOOPs. It leverages the fact that all factors except the least important can be scored in discrete units, and all factors except the most important have bounded score. Before applying the trick, say we have three factors:

f1 = the least important is scored as a float from 0 to 3.14 (note the boundedness)
f2 = the next most important is scored as an int from 0 to numAZs-1 (note the discrete units and boundedness)
f3 = the next most important is scored as an into from 0 to numDesiredInstances-1 (note the discrete units, the boundedness is irrelevant)

Then:

f3 * 3.14 * (numAZs - 1) + f2 * 3.14 + f1

Is a single score that makes f3 the most significant score, with ties then broken by f2, and if there are still ties, broken by the least significant score f1.

fraenkel commented 9 years ago

@onsi There is one major difference between what is there and what I propose. The placement of instances into zones is a fixed order given your function. Given a larger number of zones, your placement will always pick the next highest zone which may not have the best availability of resources. I don't see how this prevents floods from happening. The current function doesn't disallow or prevent flooding.

There are other ways to prevent or reduce flooding. We already have a means to prioritize who should get placed first. You are just describing additional constraints on the order of processing rather than how it should be placed.

onsi commented 9 years ago

Let me defined "flood":

Consider two zones Z1 and Z2 each with capacity "100". Now let's start "150" units of work. If the auction works correctly we'll end up with

Z1: 75 running, 25 available Z2: 75 running, 25 available

(of course, we won't quite nail it but we should be close).

Now. A meteor strikes Z2. What do we do? It would be great to avoid a situation where we flood all the missing instances in Z2 over to Z1:

Z1: 100 running Z2: borked

When Z2 eventually comes up we'll end up with:

Z1: 100 running Z2: 50 running

which isn't great and is hard to fix (rebalancing after the fact, while doable, is not something we should really rely on). What's worse is that now any new work will have to go into Z2 and will not be HA.

What the doc proposes is that only single or two-instance applications will flow from a borked Z2 to Z1. On big public installations like PWS/Bluemix this isn't a huge win as most things are single-instance demo apps.

However in enterprise environments running production workloads I imagine most applications will have many instances. With instance pinning for index > 1, we prevent these sorts of floods. Instances that should be on Z2 simply won't be allocated on Z1.

To use your parlance the AZ is sometimes a Required rule and sometimes an Optional rule. If index <= 1 it's an Optional rule. If index > 1 it's a Required rule.

onsi commented 9 years ago

And to clarify, there are some assumptions here, chiefly that all zones have equal size.

onsi commented 9 years ago

But sigh... maybe I don't care about avoiding these floods and should just simplify this so that we can finish the feature off sooner.

onsi commented 9 years ago

I'm going to validate the "no-flood" behavior with some others. If we agree that we don't need it, I'll simplify the doc and switch to something closer to what you're describing michael.

jchesterpivotal commented 9 years ago

@onsi

Fortunately, we don't need to find the best solution. Just good enough solutions.

Agreed! I definitely bend people's ears about it.

jchesterpivotal commented 9 years ago

@amitkgupta

I'm a bit lost. Putting aside the question of ranking functions, insofar as a user asks for an impossible requirement (eg. larger RAM than can be provided by any cell), what happens?

I guess I got lost because the impression I got from @fraenkel's original proposal was that users would be able to specify their non-negotiable requirements, which would introduce the possibility of the pony/not-pony problem.

As for the grandstanding which followed on my part: a little knowledge is a dangerous thing.

jchesterpivotal commented 9 years ago

@onsi, flooding sounds like a cascading failure scenario, is that right? I am not familiar with that literature, but I believe there are some strategies that can be tried. For example: simply dropping everyone who died and letting the user select a strategy, selectively ordering relocation by some QoS etc.

As I understand it, a cascading failure typically comes down to the problem that while a combined system's aggregate capacity exceeds current demand, any one node can be overwhelmed by the amount of capacity lost when a given neighbour fails. It's more about local slack than global slack.

I think this happens in networking a lot? And electrical grids.

Amit-PivotalLabs commented 9 years ago

@jchesterpivotal That's not a logical inconsistency, that's just a constraint that can't be met. If no cells meet your constraints, the app fails to place. There's a separate issue about how to represent when we don't have sufficient resources to place an instance.

jchesterpivotal commented 9 years ago

Right, I got confused about the scope of the proposed change -- I was under the impression that users would be allowed to specify arbitrary mandatory requirements.

onsi commented 9 years ago

Shopped it around a bit and the new behavior around avoiding floods seems to be useful. So let's leave the stories as they are.

fraenkel commented 9 years ago

Do you want to move the rest of this discussion to vcap-dev?

The current design has similarities at least with index 0 & 1. Here are a couple of concerns and this will apply to any placement algorithm in place.

AZs do not always have equal capacity. See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones. It mentions that AZs may have limited capacity or be removed over time.
Stacks and Placement Pools may not exist in all AZs. Some use cases around these are to target specific hardware or setup which won't be replicated as widely.

I am not saying we have to solve these now but we do need to consider them.

jbayer commented 9 years ago

@fraenkel i agree we should have the discussion in 1 place wherever that happens to be.

@littleidea had some private comments as well and having 3 different conversations covering the same ground isn't great because there is covered ground.

@onsi where do you prefer to have 1 conversation? vcap-dev or this issue seem like we'll lose track of things over time as they scroll on. the other option is to create a gdoc with comments that can be resolved, so it's less cluttered as it evolves.

onsi commented 9 years ago

Let's keep the discussion here. I'd like to avoid the too many cooks problem by taking curated well-posed questions to the mailing list and I'd like to tie issues to specific changes in the doc.

onsi commented 9 years ago

After some more thought I'm inclined to back out of this particular endeavor and just maintain the existing behavior in the DEAs. There are a number of edge cases that we'll have to be careful with if we go down this road and I see the complexity exploding.

I think Diego's in a better place than the DEAs to handle the sorts of AZ floods that I'm trying to avoid with this approach. And if we don't change things then an operator can respond by simply scaling up the surviving zone.

Any thoughts? I can switch things over to how Fraenkel has described them and update vcap-dev once we get some agreement on this thread..

fraenkel commented 9 years ago

After a day of skiing, I want to emphasize the separation of placement vs what to place which really also includes detection of scenarios that may require de-placement or killing of existing instances. I was trying to keep placement somewhat simple and generic. We layer on top what to place. We already know that in a "working" environment, we have a prioritization policy that orders the instances to place. Upon detection of "bad" scenarios, whether it be losing resources or running out of capacity, etc..., we can easily determine that instances need to be temporarily stopped.

I realized that the relationship between placement pools and AZs may have to be somewhat explicit in the definition of PPs. I haven't quite thought long and hard as you include stacks which are already somewhat odd given Linux vs Docker.

onsi commented 9 years ago

I'm going to close this issue and move the discussion over to the PR I just opened.

cloudfoundry / diego-notes

Alternative Placement proposal #4