camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Overcommit due to max flow priority over min cost by solver #71

Open pratikmeher44 opened 5 years ago

pratikmeher44 commented 5 years ago

We ran some tests with cpu memory cost model where we observed a scenario in which firmament over commits pod to a resource. After analysing the output flow graph from the solver(CS2), it was found that solver gives max flow more priority than min cost in a situation where it has to make a choice between the two. In that process it over commits pod to a resource and the pod fails with "OutOfcpu" status. If min cost was given higher priority than max flow the overcommit issue could be avoided.

As it is part of solver code implementation, can you please suggest if there is any way to make the solver solve the flow graph giving min cost more priority than max flow.

@ms705 @ICGog

pratikmeher44 commented 5 years ago

@deepak-vij @shivramsrivastava

ms705 commented 5 years ago

Hi @pratikmeher44,

I'm not sure I follow your argument about giving max-flow higher priority. Max-flow on an arc is a hard constraint -- the flow on an arc cannot exceed it, even if the alternatives lead to a higher-cost solution. The only situation in which the solver "chooses" between pushing more flow on an arc vs. pushing flow on another arc is when the arcs have identical cost; if they don't, the solver will always saturate the lower-cost arc first.

On the graph level, it is true that the solver "prioritises" max-flow in the sense that any solution to min-cost, max-flow (MCMF) must be a max-flow. The solver's task is to find a minimum-cost max-flow solution; it can never return a solution that isn't a max-flow solution over the graph.

However, I don't believe that solving your problem requires solver changes. Instead, you'll want to adopt the cost model so that the cumulative resource demands on all incoming arcs into a machine do not exceed the machine capacity. The solver can push flow up to the capacity of any incoming arc at the same time, so the cost model has to anticipate that the arcs may all be used. Alternatively, if you support preemption, you can resolve the overcommit situation after the fact by evicting low-priority tasks, but since it sounds like you're referring to a Kubernetes/Poseidon setting, that option may not be available to you (I don't think Kubernetes supports preemption yet?).

pratikmeher44 commented 5 years ago

Thanks a lot for the explanation. Just to make sure I have a clear understanding on how we can solve the overcommit issue, I am giving a more detailed explanation on what issue we are facing.

CPU Memory cost model was built with reference to net cost model and is quite similar. We assign cost to an arc based on cpu and memory availability and request for each resource node. So we intent to make the flow go through the least cost arc and accordingly we set the cost for each arc from Task aggregated EC to machine ECs. For example, if a machine has more availability of resource and the resource request is low then the arc from Task EC to machine EC will have a lower cost.

From a number of test what I observed is in a situation where the solver need not have to make a choice between min cost and max flow(mapping flow to maximum number of arcs from Task EC to machine ECs) it chooses all the min cost arcs. But in a situation where it has to make a choice between min cost and max flow, it chooses to select maximum number of flow of tasks from Task EC to machine ECs without considering to choose the least cost incoming arc to a machine EC.

As explained above CPU Memory cost model assigns cost on each arc to machine ECs assuming the flow will always happen based on min cost arc and creates the flow graph based on that. But the way solver solves the flow graph is it allows flow for max possible arcs and maps the flow of incoming arcs to machine ECs with higher cost and not the least cost arc.

Attaching the flow graph image output below from solver. I hope this will provide a clear understanding of the scenario. Listing out the details for the flow graph shown in the image. pod_overcommit dm

Based on this observation can you please suggest what could be the best solution to avoid this over commit issue.