Overcommit due to max flow priority over min cost by solver

pratikmeher44 commented 6 years ago

We ran some tests with cpu memory cost model where we observed a scenario in which firmament over commits pod to a resource. After analysing the output flow graph from the solver(CS2), it was found that solver gives max flow more priority than min cost in a situation where it has to make a choice between the two. In that process it over commits pod to a resource and the pod fails with "OutOfcpu" status. If min cost was given higher priority than max flow the overcommit issue could be avoided.

As it is part of solver code implementation, can you please suggest if there is any way to make the solver solve the flow graph giving min cost more priority than max flow.

@ms705 @ICGog

pratikmeher44 commented 6 years ago

@deepak-vij @shivramsrivastava

ms705 commented 6 years ago

Hi @pratikmeher44,

I'm not sure I follow your argument about giving max-flow higher priority. Max-flow on an arc is a hard constraint -- the flow on an arc cannot exceed it, even if the alternatives lead to a higher-cost solution. The only situation in which the solver "chooses" between pushing more flow on an arc vs. pushing flow on another arc is when the arcs have identical cost; if they don't, the solver will always saturate the lower-cost arc first.

On the graph level, it is true that the solver "prioritises" max-flow in the sense that any solution to min-cost, max-flow (MCMF) must be a max-flow. The solver's task is to find a minimum-cost max-flow solution; it can never return a solution that isn't a max-flow solution over the graph.

However, I don't believe that solving your problem requires solver changes. Instead, you'll want to adopt the cost model so that the cumulative resource demands on all incoming arcs into a machine do not exceed the machine capacity. The solver can push flow up to the capacity of any incoming arc at the same time, so the cost model has to anticipate that the arcs may all be used. Alternatively, if you support preemption, you can resolve the overcommit situation after the fact by evicting low-priority tasks, but since it sounds like you're referring to a Kubernetes/Poseidon setting, that option may not be available to you (I don't think Kubernetes supports preemption yet?).

pratikmeher44 commented 6 years ago

Thanks a lot for the explanation. Just to make sure I have a clear understanding on how we can solve the overcommit issue, I am giving a more detailed explanation on what issue we are facing.

CPU Memory cost model was built with reference to net cost model and is quite similar. We assign cost to an arc based on cpu and memory availability and request for each resource node. So we intent to make the flow go through the least cost arc and accordingly we set the cost for each arc from Task aggregated EC to machine ECs. For example, if a machine has more availability of resource and the resource request is low then the arc from Task EC to machine EC will have a lower cost.

From a number of test what I observed is in a situation where the solver need not have to make a choice between min cost and max flow(mapping flow to maximum number of arcs from Task EC to machine ECs) it chooses all the min cost arcs. But in a situation where it has to make a choice between min cost and max flow, it chooses to select maximum number of flow of tasks from Task EC to machine ECs without considering to choose the least cost incoming arc to a machine EC.

As explained above CPU Memory cost model assigns cost on each arc to machine ECs assuming the flow will always happen based on min cost arc and creates the flow graph based on that. But the way solver solves the flow graph is it allows flow for max possible arcs and maps the flow of incoming arcs to machine ECs with higher cost and not the least cost arc.

Attaching the flow graph image output below from solver. I hope this will provide a clear understanding of the scenario. Listing out the details for the flow graph shown in the image. pod_overcommit dm

Resource Availability: Res...a34 = 16000 milliCPU Res...023 = 8000 milliCPU
Resource request from tasks: Tas...246 = 4500 milliCPU Tas...683 = 6000 milliCPU Tas...348 = 6000 milliCPU Tas...816 = 6000 milliCPU
Task aggregated ECs: EC...949 -> aggregated EC to group tasks with resource request of 4500 milliCPU. EC...487 -> aggregated EC to group tasks with resource request of 6000 milliCPU.
Res...a34 has a capacity of 16000 milliCPU so it can fit 3 tasks of 4500 milliCPU and 2 tasks of 6000 milliCPU. Hence it has 3 machine ECs created for it, out of which all 3 machine ECs are connected to EC...949 and first 2 machine ECs are connected to EC...487.
Res...023 has a capacity of 8000 milliCPU so it can fit 1 task of 4500 milliCPU and 1 task of 6000 milliCPU. Hence it has 1 machine EC created for it, and that machine EC is connected to both EC...949 and EC...487.
Observation: For example, in case of machine EC EC...943 it has two incoming arcs. One arc from Task EC EC...949 with cost 2265 and other arc from EC_...487 with cost 2327. The solver decides to map the flow through higher cost arc instead of lower cost arc. Hence here we achieve maximum flow of tasks but considering the aggregated CPU request from each task we do not have enough resource for all 4 tasks to be scheduled. If solver had taken the decision based on min cost, it would have respected the resource availability and this issue could have been avoided. Only thing that will happen is it will schedule 3 tasks instead of all 4 tasks if solver considers min cost as more priority.

Based on this observation can you please suggest what could be the best solution to avoid this over commit issue.

camsas / firmament

Overcommit due to max flow priority over min cost by solver #71